Website Crawler: Online crawling and data extraction

Website Crawler is a SaaS (Software as a Service) that you can use to crawl and analyze up to 100 pages of a website for free in real-time. You can run the crawler as many times as you want, up to the daily set limit. Website Crawler is robust and fast. It can generate JSON or CSV format file from the extracted data.

Visual results of analysis

Website Crawler displays vital details of the analysis in a pie chart. You can thus quickly find out which areas of your website needs optimization. Once you have fixed your site, re-run the crawler and see the latest results. Our charts are updated in real-time.

Data extraction

With WebsiteCrawler, you can extract data from websites with just a click of a button. Once our platform crawls your site, your data is available for download instantly. You can download the data in a CSV or JSON format file. We also offer an API which supports data retrieval in JSON format in case you want structured data for your project/software. You can configure our platform to scrape information of your choice from web pages.

Technical reports

The visualization of data is just the first step in making your website better. We provide detailed reports of analysis so that you can fix your website and improve its search presence. We let users filter data with many conditions. From finding internal redirects to pages with duplicate content, we provide many reports that will help you improve your site.

Features

link_off Broken Links: WebsiteCrawler makes you aware of unreachable, internal and external links on your site. This SaaS checks HTTP status code of each URL on the pages it has analyzed and makes you aware of unreachable URLs.

bolt Page speed: This SaaS detects and displays the loading time of the pages it has analyzed. You can filter the pages by their loading time. Thus, you can find pages that are slow and fast in no time.

file_copy Duplicate titles, meta tags: Multiple title, meta description tags can confuse search bots especially those who are indexing your pages for ranking in the search engines. With Website Crawler, you can easily find the pages of a site that have multiple title or meta tags.

broken_image Missing Alt Tags: Search bots index images displayed on the HTML pages and displays them in their image search tools. If the image URL does not have an alt tag, it may not rank for search keywords. This SaaS has a missing alt tag reports which you can use to find pages having images without alt tag.

account_tree XML Sitemap: This SaaS can generate an XML sitemap for your site with a click of a button. You can exclude URLs from the sitemap or add priority or specify "changefrequency" for the URLs. If you're using a CMS or a custom-built site that does not have a sitemap, use this feature.

file_export Export data: You can export/download the data displayed in the reports section to a PDF, CSV, or a spreadsheet file with a few clicks of a button. There's also an option to export the entire website data to a file. Website Crawler can also generate LLM ready structured data format i.e. JSON file from the scraped data in just one click of a button.

javascript JavaScript Crawling: This SaaS can execute JS code on JS enabled enabled web pages. It can also render JS heavy sites.

link Canonical Link issues: One of the major reasons why pages might not rank despite having good content is improper canonical links. Website Crawler finds invalid canonical links on the pages of your site and displays it.

format_h1 Pages with/without heading tags: Want to know which pages on your site lack heading tags h1 to h5? Want to find the pages on your site that have small headings or headings containing a specific word? With Website Crawler, it is easy to analyze the h1 to h5 HTML tags used on the pages of websites. You can filter heading tags containing certain words, letters, etc.

network_node The number of internal/external links: This platform can display the number of internal and external links that pages on a website have. You can filter the list by the URL count with just one click of a button.

abc Thin content: Ranking of websites can tank after an algorithm update if it has a lot of pages with thin content. Finding thin content on a site is a breeze with this SaaS.

acuteFast: WebsiteCrawler.org is fast. It can crawl 1000s of pages within a few minutes. It can execute the scraping/crawling tasks in the background while you work on other things.

format_h1Custom data: You can configure this platform to extract/scrape certain data from the pages of a site. You can see if the tag whose data you want is fetching any data in real-time.

articleLog files: You can see useful data from the access log files with our log file analyzer [beta].

spellcheckBulk check spelling mistakes: WebsiteCrawler can bulk check 100s of articles for spelling mistakes with one click of a button. After identifying the mistakes, it will make you aware of the pages with spelling errors.

content_copyBulk check duplicate content: Our platform can efficiently identify and make you aware of duplicate content on your site. A single click of a button will reveal every link on your site that has content matching with content on other pages of your site.

Who should use Website Crawler?

This SaaS has been designed and built for:

Extracting data from websites. (eCommerce portals, sites built using WordPress, Blogger, Drupal, Joomla, or any other content management systems, or sites that have been built from scratch, with site builders, JS heavy portals, etc)
Analyzing the extracted data for finding errors and discovering areas of optimization.
Exporting the structured data to a file.
Integrating with third-party services that require clean structured JSON data.

If you own a website that uses a CMS, use this application as it can help you get rid of plugins and reduce the load on your server as SEO analysis will be done on the cloud. If you have built the site using a site builder tool or by yourself, you can discover on-page SEO issues with this SaaS or you're a researcher, student, etc working on an AI project and want to train your model on a website's dataset, you can use WebsiteCrawler.

FAQs

What is Website crawler?

WebsiteCrawler is a SaaS (Software as a Service) that crawls every link it has found on an entered domain. It does not overwhelm any server but does the job like a pro.

How to use Website crawler?

Enter a non redirecting and reachable website domain (include https, www, http, etc) and the number of URLs you want this SaaS to analyze and click the submit button. Once the crawler gets into action, you will see the list of URLs that have been analyzed. This list is updated every 2 seconds (for paid users) or 10 to 15 seconds (for free users). Once the number of links in the list is equal to the limit you've entered, you will see a form with option to log in with your Google account or register a new account. Proceed with the option of your choice to see the dashboard.

What sites does this SaaS support?

WebsiteCrawler can render JS heavy sites. It thus supports every publicly reachable website. It does not automatically fill and submits form. It works only with publicly available information on the HTML pages.

What is the daily limit?

We have set a daily limit of 100 URLs for free plan users. For paid users, this limit is increased to 1000+. How does this feature work? WebsiteCrawler keeps a record of the total links of a domain it has crawled. Once the daily threshold is reached and you enter the domain and limit in the above form, and click on the "crawl my site now" button, you will see an error.

What is the sitemap crawl function?

Although this SaaS supports JS, some pages of the site may be poorly linked. Here's when this feature comes in handy. To make WebsiteCrawler crawl a sitemap, you should enter the url of the sitemap in the "xml sitemap" text box available above. Websitecrawler.org will extract each URL from the sitemap file and analyze the number of pages you want us to analyze.

Do we support custom tags?

Yes, in the settings page of WebsiteCrawler.org, there's a "custom tags" section where you have to select a project, enter a URL and the tags you want this software to scrape (you must enter CSS tag e.g. div > p). You should fill out this form and click the submit button. If the tag is valid and you see some matched results below the form, it will be added to the list of tags that will be processed.

What data format does WebsiteCrawler support?

WebsiteCrawler lets users download data of a website in a comma separated value (CSV) or JSON file. The generated JSON file includes a JSON Array containing one or several JSON objects. The time taken to download the file depends on the data length and your internet connection speed.

Is our data LLM ready format or suitable/compatible with large language models (LLMS)?

Yes, this platform provides an API through which you can get data in LLM ready format instantly once the website data is in its database. You have to create an API key to use this feature. A few lines of code can integrate WebsiteCrawler with any LLM of your choice provided it supports JSON data.

You entered a non redirecting domain name, limit and clicked the submit button but nothing happened. What to do?

The crawl progress should appear within 15 to 20 seconds you have clicked the button. In case this does not happen, use the sitemap crawl function i.e. enter the sitemap URL instead of the non redirecting domain and try again.