Disclaimer : Real Data API only extracts publicly available data while maintaining a strict policy against collecting any personal or identity-related information.
Check the selected websites for reliability, anti-scraping algorithms, software, and the expected to compute unit consumption before scraping them using Website Checker. It is available in countries like USA, UK, UAE, Canada, France, Spain, Germany, Australia, Mexico, Singapore, etc.
It is a simple website crawling tool that permits you to scrape website data to check its performance and blocking with methods like Playwright, Puppeteer, and Cheerio.
The website data scraper has the following features; check them out:
The primary task of the website checking tool is to check how often a source website blocks scrapers. Here, enter the start URL and the first product or category. You can add enqueueing with pseudoUrls + linkSelector or set replicateStartUrls, as these are good alternatives to test various proxy servers.
Pick run option combinations, and the website crawler will spawn the scraper for each proxy and scraping tool to combine the data results into a single dataset.
Ultimately, the website scraping tool will give you the blocking rate statistics. To ensure the scraper accurately discovers the page status, check out the page screenshots. It will give you the detailed result for each URL with the help of a dataset or key-value store.
The Website Analysis Tool has no restrictions to check the selected number of configurations and websites. It helps you to check every configuration for all the websites. For that, you must set the maxConcurrentDomainsChecked reasonably to fit each parallel run into the total memory - 8 GB for Playwright/Puppeteer check and 4 GB for the Cheerio check, respectively.
Check out the input page of the website performance analysis tool for more details. There are reasonable defaults in almost every input field.
{ "timeouted": 0, "failedToLoadOther": 9, "accessDenied": 0, "recaptcha": 0, "distilCaptcha": 24, "hCaptcha": 0, "statusCodes": { "200": 3, "401": 2, "403": 5, "405": 24 }, "success": 3, "total": 43 }
Follow the https://api.RealdataAPI.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true for detailed output with HTML links, screenshots, and URLs.
Visit the CHANGELOG to see the history of changes.
To run the code examples, you need to have an RealdataAPI account. Replace
< YOUR_API_TOKEN>
in the code with your API token.
import { RealdataAPIClient } from 'RealdataAPI-Client';
// Initialize the RealdataAPIClient with API token
const client = new RealdataAPIClient({
token: '<YOUR_API_TOKEN>',
});
// Prepare actor input
const input = {
"urlsToCheck": [
{
"url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
}
],
"proxyConfiguration": {
"useRealdataAPIProxy": true,
"RealdataAPIProxyGroups": [
"SHADER",
"BUYPROXIES94952",
"RESIDENTIAL"
]
},
"repeatChecksOnProvidedUrls": 10,
"maxNumberOfPagesCheckedPerDomain": 1000
};
(async () => {
// Run the actor and wait for it to finish
const run = await client.actor("lukaskrivka/website-checker").call(input);
// Fetch and print actor results from the run's dataset (if any)
console.log('Results from dataset');
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.dir(item);
});
})();
from RealdataAPI_client import RealdataAPIClient
# Initialize the RealdataAPIClient with your API token
client = RealdataAPIClient("<YOUR_API_TOKEN>")
# Prepare the actor input
run_input = {
"urlsToCheck": [{ "url": "https://www.amazon.com/b?ie=UTF8&node=11392907011" }],
"proxyConfiguration": {
"useRealdataAPIProxy": True,
"RealdataAPIProxyGroups": [
"SHADER",
"BUYPROXIES94952",
"RESIDENTIAL",
],
},
"repeatChecksOnProvidedUrls": 10,
"maxNumberOfPagesCheckedPerDomain": 1000,
}
# Run the actor and wait for it to finish
run = client.actor("lukaskrivka/website-checker").call(run_input=run_input)
# Fetch and print actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)
# Set API token
API_TOKEN=<YOUR_API_TOKEN>
# Prepare actor input
cat > input.json <<'EOF'
{
"urlsToCheck": [
{
"url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
}
],
"proxyConfiguration": {
"useRealdataAPIProxy": true,
"RealdataAPIProxyGroups": [
"SHADER",
"BUYPROXIES94952",
"RESIDENTIAL"
]
},
"repeatChecksOnProvidedUrls": 10,
"maxNumberOfPagesCheckedPerDomain": 1000
}
EOF
# Run the actor
curl "https://api.RealdataAPI.com/v2/acts/lukaskrivka~website-checker/runs?token=$API_TOKEN" /
-X POST /
-d @input.json /
-H 'Content-Type: application/json'
urlsToCheck
Required Array
It is the static URL list to check captchas. Allow the Use request queue to add new links on the fly.
proxyConfiguration
Optional Object
Mention the proxy server to help your scraper hide its original IP and successfully check websites.
checkers.puppeteer
Optional Boolean
Crawl websites with Puppeteer.
checkers.cheerio
Optional Boolean
Crawl websites with Cheerio.
checkers.playwright
Optional Boolean
Crawl websites with the Playwright.
saveSnapshot
Optional Boolean
It will store HTML + screenshots for Playwright/Puppeteer and HTML for Cheerio.
enqueueAllOnDomain
Optional Boolean
It will enqueue any domain URL.
linkSelector
Optional String
The CSS-based link selector says the URLs to follow and add to the request queue. You can only apply the CSS selector by enabling the request queue. Further, use the pseudo-URLs setting to filter URLs from the queue.
The scraper will ignore page URLs if the link selector is blank.
pseudoUrls
Optional Array
It mentions the type of links the link selector finds you should add to the request queue. Pseudo URLs belong to regular expressions inside the [] bracket. You can only apply this setting by enabling the request queue option. If the scrape omits pseudo URLs, the scraper will find URLs from the link selector and enqueue them.
repeatChecksOnProvidedUrls
Optional Integer
The scraper will check every provided link repeatedly. It is helpful to bypass the blocking of the first webpage or test the same link.
maxNumberOfPagesCheckedPerDomain
Optional Integer
Here, the scraper will load the maximum number of web pages. Once the limit reaches, the website checker will stop. It is the best practice to set these restrictions on the website checker to load limited pages so that it will prevent excessive usage of platform credits and storage. Remember that the scraper may load slightly higher page counts than the limit.
There is no restriction if you set the limit to zero.
maxConcurrentPagesCheckedPerDomain
Optional Integer
It mentions the maximum page count the scraper processes parallelly for a single domain. The website scraping tool automatically varies the concurrency based on existing system sources. The maximum Concurrency option allows you to set an upper limit to reduce the loading on the selected domain.
maxConcurrentDomainsChecked
Optional Integer
It mentions the maximum domain counts that it can check at once. It is relevant to pass multiple URLs for checking at the same time.
retireBrowserInstanceAfterRequestCount
Optional Integer
It is to check the frequency of the browser on how often it rotates itself. Pick a lower number for more consumption and a higher one for minor consumption.
navigationTimeoutSecs
Optional Integer
It mentions the maximum duration in seconds that the request will allow the page to load. If the website checker fails to load the page within the given time, the browser will display an error, leading to failure.
puppeteer.headful
Optional Boolean
It only works for the Puppeteer category.
puppeteer.useChrome
Optional Boolean
It only works for the Puppeteer category. But, Chrome may not work with Puppeteer.
puppeteer.waitFor
Optional String
It only works for the Puppeteer category. It will wait for every webpage. Provide numbers in the selector or ms.
puppeteer.memory
Optional Integer
You must choose a memory from 128 to 32768 at the power of 2.
playwright.chrome
Optional Boolean
Use Chrome to check
playwright.firefox
Optional Boolean
Use Firefox to check
playwright.webkit
Optional Boolean
Use Safari to check
playwright.useChrome
Optional Boolean
It only works for the Playwright category. Please remember that Chrome may not work with Playwright.
playwright.headful
Optional Boolean
See if the web browser is headful or not.
playwright.waitFor
Optional String
It only works for the Playwright category. It will wait for every page. Please provide numbers in a selector or ms.
playwright.memory
Optional Integer
You must choose a memory from 128 to 32768 at the power of 2.
{
"urlsToCheck": [
{
"url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
}
],
"proxyConfiguration": {
"useRealdataAPIProxy": true,
"RealdataAPIProxyGroups": [
"SHADER",
"BUYPROXIES94952",
"RESIDENTIAL"
]
},
"checkers.cheerio": true,
"checkers.puppeteer": true,
"checkers.playwright": true,
"saveSnapshot": true,
"enqueueAllOnDomain": true,
"pseudoUrls": [],
"repeatChecksOnProvidedUrls": 10,
"maxNumberOfPagesCheckedPerDomain": 1000,
"maxConcurrentPagesCheckedPerDomain": 500,
"maxConcurrentDomainsChecked": 5,
"retireBrowserInstanceAfterRequestCount": 10,
"navigationTimeoutSecs": 60,
"puppeteer.waitFor": "2000",
"puppeteer.memory": 4096,
"playwright.chrome": false,
"playwright.firefox": true,
"playwright.waitFor": "2000",
"playwright.memory": 4096
}