Disclaimer : Real Data API only extracts publicly available data while maintaining a strict policy against collecting any personal or identity-related information.
Web Scraper is the primary tool of Real Data API for scraping and crawling various browsers. The Web Scraper crawls arbitrary websites on the Chrome browser and collects the desired data using a submitted Python code. This actor is compatible with both lists of URLs and recursive crawling to manage concurrency for optimizing performance. Crawl web browsers to scrape required web data in the USA, UK, UAE, Australia, France, Canada, Germany, Singapore, Spain, Mexico, and other countries.
Web Scraper is an easily usable data scraping tool that crawls arbitrary web pages and collects data in a structured format using a javascript program. This tool uses the Chromium browser to crawl web pages and access dynamic content. It stores collected data in a usable, downloadable dataset in multiple digestible formats, like CSV, JSON, or XML. You can set up a web data extractor and run it manually or automatically using scraping API.
You can access our web scraping course from our documentation if you need more experience in front-end website development or web data scraping. You can study many examples while learning through the course in a stepwise manner. After that, you can continue using the web scraping tool.
You can visit the pricing page to learn about the usage cost of a Web Scraper. These cost projections depend on the size of pages you scrape from the web. Different types of Web Scrapers are equivalent to simple HTML and total web pages.
You need two things to start using this web scraping tool. Firstly you need to inform the scraper to crawl the required pages with links. And the second thing is to tell it how to scrape data from those pages.
The tool begins the process by crawling the targeted pages using the start URL from the input parameters. You can use a link selector to make the tool follow links, pseudo links, or global patterns to inform the scraper to add specific links in the crawling queue. You can efficiently perform recursive crawling and discover each product from the online store.
You have to provide a page function based on Javascript code to the scraper to load targeted web pages and tell it how to collect data from those pages. As the tool uses the Chromium browser with full features, writing code for the Page function is similar to writing code for front-end development, and you can use JQuery and similar libraries for the same.
Shortly, the Web Crawler world follows the following steps.
The tool has many settings to configure for performance improvement and to set up cookies to log in to various websites, etc. Check the below input configurations for the list of settings.
We've designed the web data scraper for easy and generic applications. It may not help you if you need flexibility and advanced-level performance.
As already mentioned, the scraper uses a Chromium browser with full features. It may overkill websites that don't render content using JavaScript code dynamically. You may use Cheerio data scraper to reach an excellent level of performance to scrape these websites. It extracts the data, exports it, and processes HTML pages with raw data without using browser overhead.
Since you execute the Page function of the web scraping automation tool in the context of a targeted web page, the function supports javascript code with the client side. You can opt for a Puppeteer scraper if you want to use server-based libraries or an underlying Puppeteer library to control the Chromium browser. Further, if you use Playwright, you can explore our Playwright scraper. Besides these scrapers, you can develop your customized actor using Crawler, Our SDK, and Node.js for more control and flexibility.
In the input fields, the web scraping tool accepts multiple setting configurations. You can enter them using a JSON code or manually using the Real Data API actor or console. Please check the input tab to learn more about all input fields with their types.
The run mode permits you to switch between two operating modes of the scraper.
You get full performance and control through the production mode of the scraper. Once you change the tool, you must switch the scraper to its production mode.
When you start developing your tool, you should be able to inspect events in your browser to debug the code. You can inspect those events during the development mode of the scraper. It permits you to use Chrome DevTools and control your browser directly. It will also prevent timeouts and restrict concurrency to improve your experience with DevTools. To access DevTools, visit the Live View tab and open it. You can explore the advanced configuration section to configure other debugging options.
The field with Star URLs represents the list of initial page links the tool will load. You can enter these links one by one. If you want to enter multiple links simultaneously, compile them in a Google Sheet or CSV file and upload them. There should be https:// or http:// protocol as a prefix for all URLs.
The scrape allows adding new links to collect on the fly, using Glob pattern, link selector, or Pseudo URL options. It also allows using the page function context.enqueueRequest().
It helps to find which URL the scraper loads currently to take specific actions. For instance, you can take multiple actions on product detail or product listings pages while scraping an online store. Alternatively, you can associate all URLs with custom data for the user with JSON object that refers JavaScript program in the page function context.request.userData. You can check the tutorial for this scraper in our documentation to learn more.
The Link selector field uses CSS selector to find URLs to other pages, with href attribute containing elements.
On each loaded lage, the tool tries to match links from the Link selector by loading page links and checks whether it can match links with Pseudo URLs or Glob Patterns. If so, it adds the link to the queue to load that URL later.
To match all the links, our API creates new scrapers with the following pattern of selectors.
a[href]
The scraper ignores all the page links if the Link selector is blank and only crawls the URLs available in the request queue using the context.enqueueRequest() or mentioned in the Start URLs.
It mentions the type of links that the Link selector should find and add to the request waiting list. The Glob pattern contains a simple string based on wildcard characters.
For instance, http://www.example.com/pages/**/* Glob Pattern will match the following links:
Remember that you don't have to use the settings of Glob Patterns because it allows you to control the function such that it will access the page function for calling await context.enqueueRequest().
It mentions the type of URLs you should add to the request lists after their discovery by Link Selector.
It is a simple link with the special directives in the [] brackets. It only supports [regexp], which defines regular expressions with a JavaScript style that matches against the link.
For instance, the pseudo link will match the following links, where the pseudo URL is http://www.example.com/pages/[(\w|-)*].
You must encode [\x5B] or [\x5D] in case either [ or ] is present in the standard query string. Here is an example pseudo URL.
http://www.example.com/search?do[\x5B]load[\x5D]=1
It will match with the following link:
http://www.example.com/search?do[load]=1
Alternatively, you can associate all pseudo links with the user information. You can use the context.request.label to refer to the user data and find the type page the scraper currently loads in the web browser.
Remember that you don't have to use any pseudo URL setting. It is because you can control the scraper to load a specific page using the page function and calling await context.enqueueRequest().
It consists of a function with Javascript code that the scraper implements to load all pages on the Chromium-based web browser. The function helps to extract required web page data, click elements to manipulate the DOM, add new links to the request waiting list, and control the operation for web scraping.
async function pageFunction(context) { // jQuery is handy for finding DOM elements and extracting data from them. // To use it, make sure to enable the "Inject jQuery" option. const $ = context.jQuery; const pageTitle = $('title').first().text(); // Print some information to actor log context.log.info(`URL: ${context.request.url}, TITLE: ${pageTitle}`); // Manually add a new page to the scraping queue. await context.enqueueRequest({ url: 'http://www.example.com' }); // Return an object with the data extracted from the page. // It will be stored to the resulting dataset. return { url: context.request.url, pageTitle, }; }
It accepts the context object and an argument whose properties are available in the following table. Considering the scraper executed the function to scrape a web page, it can use a document global variable or window and access DOM.
The object or an array of multiple objects representing the returning value of the page function represents the collected web page data. The return value can only have basic data types without any circular references and should be stringify-able to the JSON. If you don't wish to collect web page data, you can skip the process in the clean outputs and return an undefined value.
The JavaScript ES6 syntax is compatible with the page function asynchronously. It means you can wait for scraping operations in the background using await Keyword. Check out Mozilla documentation to know more about asynchronous functions.
customData: Object
The object contains custom data in the input settings. It helps to pass dynamic parameters to your Instant Data Scraper to Scrape required web data using API.
enqueueRequest(request, [options]): AsyncFunction
An object is the request parameter that contains the request details with labels, user data, URL, headers, and other properties. If the request doesn't contain any URL, this function adds it. To access the complete list of supported function properties, you can visit the object's constructor by exploring the Crawlee documentation.
The parameter has multiple additional options. Presently, only the forefront Boolean flag is compatible with this parameter. If it is correct, it adds the request at the start of the queue. Typically the request queue adds new links at the end.
Example:
await context.enqueueRequest({ url: 'https://www.example.com' }); await context.enqueueRequest({ url: 'https://www.example.com/first' }, { forefront: true });
env: Object
Our platform sets the map to all the relevant values to run the scraper using realdataapi_environment variables. For instance, you can search the information like timeouts, scraper run memory, and id. If you need a complete list of existing values, check out our SDK page and find the RealdataAPIEnv interface.
Example:
console.log(`Actor run ID: ${context.env.actorRunId}`);
getValue(key): AsyncFunction
It studies default key-value stores connected to the scraper execution to get the value. It uses the key-value store to get the data, like files and state objects. It is similar to the Actor.getValue() function in the SDK available on our platform.
You can use the double-function context.setValue(key, value) and set the value.
Example:
const value = await context.getValue('my-key'); console.dir(value);
globalStore: Object
It represents the store with memory that you can use to share information across invocations of the page function, like API responses, state variables, and other data. The object has a similar interface to the Map object of JavaScript with some variations.
Remember that there is no persistence in the saved data. If you restart the scraper run or migrate to another server, the globalStore content resets. Hence only sometimes depends on the presence of a particular value in the store.
Example:
let movies = await context.globalStore.get('cached-movies'); if (!movies) { movies = await fetch('http://example.com/movies.json'); await context.globalStore.set('cached-movies', movies); } console.dir(movies);
input: Object
It is an object with the input configurations of the Web Scraper. All page function invocation gets a fresh input object copy; it doesn't affect changing properties.
jQuery: Function
It is handy for DOM manipulation, traversing, querying, and data collection using JQuery library reference. It is available only if you enable Inject JQuery option.
Usually, you've to register the JQuery function under the $ global variable. But, the late may use it for other activities. To avoid it, you can sideline registering JQuery objects globally and use the existing context.jQuery property.
Example:
const $ = context.jQuery; const pageTitle = $('title').first().text();
log: Object
It is an object with a logging function having a similar interface that Crawlee.utils.log object provides from Crawlee. It helps to write log messages directly to the run log of the scraper, which is helpful to debug and track. Remember that if you set the input to enable debug log, only log.debug() will print messages.
Example:
const log = context.log; log.debug('Debug message', { hello: 'world!' }); log.info('Information message', { all: 'good' }); log.warning('Warning message'); log.error('Error message', { details: 'This is bad!' }); try { throw new Error('Not good!'); } catch (e) { log.exception(e, 'Exception occurred', { details: 'This is really bad!' }); }
request: Object
It is an object that contains data for a presently crawled web page, like the number of attempts, the URL, a unique key, and more. Its properties are similar to the Crawlee Request object.
response: Object
It is an object that contains data for web server HTTP response. Presently it contains only headers and status properties. Check out the example to learn more.
{ // HTTP status code status: 200, // HTTP headers headers: { 'content-type': 'text/html; charset=utf-8', 'date': 'Wed, 06 Nov 2019 16:01:53 GMT', 'cache-control': 'no-cache', 'content-encoding': 'gzip', }, }
saveSnapshot(): AsyncFunction
It stores the complete HTML of the present webpage to the key-value store and the screenshot using scraper run under the SNAPSHOT-HTML and SNAPSHOT-SCREENSHOT commands, respectively. It helps while debugging the scraping tool.
Remember that all snapshots overwrite previous snapshots and throttle saveSnapshot() calls to a maximum of a single call in two seconds to avoid scraper slowdown and extra resource consumption.
setValue(data, key, options): AsyncFunction
The function has a similarity with KeyValueStore.setValue() function available in Crawlee. It uses a key-value store to set default values associated with the scraper execution. The store with key value is valuable to persist selected data records like files, state objects, etc.
Access the dual function await context.getValue(key) and get the necessary value.
Example
await context.setValue('my-key', { hello: 'world' });
skipLinks(): AsyncFunction
If you call this function, it will ensure that you do not add links from the current page to the request list, even though they match glob patterns, a Link selector, or Pseudo URLs settings. You can stop the recursive crawling of webpages programmatically if you know the absence of links on the current page that you can follow.
waitFor(options, task): AsyncFunction
It is a helper function to wait a particular time for a specific element using the CSS selector for a given function to reflect true or show appearance in the DOM. It helps to collect data from selected web pages using dynamic content where there may not be any content when you call the function.
It contains an object-like options parameter with the below default values and properties.
{ // Maximum time to wait timeoutMillis: 20000, // How often to check if the condition changes pollingIntervalMillis: 50, }
Example:
// Wait for selector await context.waitFor('.foo'); // Wait for 1 second await context.waitFor(1000); // Wait for predicate await context.waitFor(() => !!document.querySelector('.foo'), { timeoutMillis: 5000 });
It allows you to set up proxy servers so that you can use them while running the scraper so that it prevents detection by source websites. You can use custom SOCKS5 or HTTP and Real Data API proxy servers.
Using proxies is mandatory to run the web scraping extension. You can configure the proxy using the following options.
Apify Proxy (automatic) | The scraper will load all web pages using Apify Proxy in the automatic mode. In this mode, the proxy uses all proxy groups that are available to the user, and for each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available proxy groups on the Proxy page in Apify Console. |
---|---|
Apify Proxy (selected groups) | The scraper will load all web pages using Apify Proxy with specific groups of target proxy servers. |
Custom proxies |
The scraper will use a custom list of proxy servers.
The proxies must be specified in the Example:
|
You can set up the proxy configuration programmatically while calling the scraper using our API after setting the proxyConfiguration fields. You can feed the JSON objects in the following manner:
{ // Indicates whether to use Apify Proxy or not. "useApifyProxy": Boolean, // Array of Apify Proxy groups, only used if "useApifyProxy" is true. // If missing or null, Apify Proxy will use the automatic mode. "apifyProxyGroups": String[], // Array of custom proxy URLs, in "scheme://user:password@host:port" format. // If missing or null, custom proxies are not used. "proxyUrls": String[], }
Using the initial cookies field, you can set cookies that your scraper will use while logging into the targeted website. These cookies are files with small texts that the web browser will store on your device. Several websites use cookies and save information related to your current login session. You can transfer the login information to the scraper input and log in to the required website. Learn more about how browser automation tools help to log into websites by transferring session cookies from our dedicated tutorial.
Note that the lifetime of cookies is limited, and they will expire after a specific duration. You must update your cookies frequently so your scraper can log in to the website regularly. The optional step is to use a page function to actively keep the scraper logged in to the website. To learn more, check out our guide on how to log in to websites using Puppeteer.
The Web Scraping tool expects the initial cookies field to store cookies in the JSON array as a separate object. Learn more about it in the following example.
[ { "name": " ga", "value": "GA1.1.689972112. 1627459041", "domain": ".apify.com", "hostOnly": false, "path": "/", "secure": false, "httpOnly": false, "sameSite": "no_restriction", "session": false, "firstPartyDomain": "", "expirationDate": 1695304183, "storelId": "firefox-default", "id": 1 } ]
Pre-navigation Hooks
It is a function array that will execute before running the primary page function. All these functions pass a similar context object as it passed into a page function. But the second object DirectNavigationOptions also passes.
You can see the existing options here:
preNavigationHooks: [ async ({ id, request, session, proxyInfo }, { timeout, waitUntil, referer }) => {} ]
Unlike Cheerio, Puppeteer, and Playwright Scrapers, we don't have any actor object in the hook parameter in the Web Scraper since we've set the algorithm to execute the hook inside the web browser.
To learn more, check out the document for Puppeteer hook and pre-navigation type of hooks and see how objects pass in functions.
Post-navigation Hooks
It is a function array that will execute once the primary page function completes the run. CrawlingContext is the only available object parameter.
postNavigationHooks: [ async ({ id, request, session, proxyInfo, response }) => {} ]
Unlike with Cheerio, Puppeteer, and Playwright Scrapers, we don't have any actor object in the hook parameter in the Web Scraper since we've set the algorithm to execute the hook inside the web browser.
To learn more, check out the document for Puppeteer hook and post-navigation type of hooks and see how objects pass in functions.
Insert Breakpoint
If you set the Run mode to production, this property has no impact. When you set it to development, it injects a breakpoint at the preferred location in each page the scraper crawls. Implementation of the program stops at the breakpoint until you resume it manually in the window of the DevTools that you can access using the container URL or the Live View tab. You can add extra breakpoints using the debugger and the page function statement.
Debug Log
If you set it to true, it will include debugging messages in the log. You can log into your debug messaging using context.log.debug('message') option.
Web Browser Log
After setting it to true, it will include console messages in the actor log. You may see error messages, low-value messages, and errors with high concurrency.
Custom Data
Since the fixed user interface input, it doesn't help to add other required fields for particular applications. If you have arbitrary data and you want to pass it to the scraper, you can use the input field with custom data within the high-level configuration. You will also see its contents under the customData with the context key as a page function object.
Customized Namings
You can set up customized names for the below fields with the last three alternatives available in the advanced configuration.
The scraper will retain the named storage. It is ok if you don't name the storage if you need the data in it with persistence to our platform for the day count according to your plan. Besides, if you use named storages, it will permit you to share it across many executions (for instance, instead of using a separate dataset for each run, you can use a single dataset for all runs). Check it out here.
Results
The page function will reflect the web scraping output and store it in the default dataset related to the execution of the actor. It will allow you to export the data to multiple usable formats, like XML, JSON, Excel, or CSV. The Web Scraper pushes the single record to the dataset for all the returning objects from the page function and expands it with the metadata, like the web page URL from which you'll get the required data.
Here is an example of a page function that returns the sample object:
{ message: 'Hello world!'; }
The stored object in the usable dataset will look like the below format:
{ "message": "Hello world!", "#error": false, "#debug": { "requestId": "fvwscO2UJLdr10B", "url": "https://www.example.com/", "loadedUrl": "https://www.example.com/", "method": "GET", "retryCount": 0, "errorMessages": null, "statusCode": 200 } }
You can call the API endpoint Get dataset items to download outputs.
https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json
Where [DATASET_ID] is the ID of the dataset if the scraper runs. Once the scraper starts, you can observe the returning run object. Optionally, you'll discover export UELs for the outputs on your console account.
Add the query parameter clean=true to the API link or an option to clean items, skip the #debug and #error metadata fields from the output, and remove the blank results in your console account.
Add the query parameter clean=true to the API link or an option to clean items, skip the #debug and #error metadata fields from the output, and remove the blank results in your console account.
Extra Resources
Congratulations! You have studied the working process of the Web Scraper. Now, you can also explore the following resources:
Upgrading
You can learn more about minor breaking updates in the V2 in the migration guide.
The V3 added more breaking upgrades. You can explore them in the V3 migration guide.
Breaking changes specific to the scraper.
You should have a Real Data API account to execute the program examples. Replace < YOUR_API_TOKEN >
in the program using the token of your actor. Read about the live APIs with Real Data API docs for more explanation.
import { ApifyClient } from 'apify-client';
// Initialize the ApifyClient with API token
const client = new ApifyClient({
token: '<YOUR_API_TOKEN>',
});
// Prepare actor input
const input = {
"runMode": "DEVELOPMENT",
"startUrls": [
{
"url": "https://crawlee.dev"
}
],
"linkSelector": "a[href]",
"globs": [
{
"glob": "https://crawlee.dev/*/*"
}
],
"pseudoUrls": [],
"pageFunction": // The function accepts a single argument: the "context" object.
// For a complete list of its properties and functions,
// see https://apify.com/apify/web-scraper#page-function
async function pageFunction(context) {
// This statement works as a breakpoint when you're trying to debug your code. Works only with Run mode: DEVELOPMENT!
// debugger;
// jQuery is handy for finding DOM elements and extracting data from them.
// To use it, make sure to enable the "Inject jQuery" option.
const $ = context.jQuery;
const pageTitle = $('title').first().text();
const h1 = $('h1').first().text();
const first_h2 = $('h2').first().text();
const random_text_from_the_page = $('p').first().text();
// Print some information to actor log
context.log.info(`URL: ${context.request.url}, TITLE: ${pageTitle}`);
// Manually add a new page to the queue for scraping.
await context.enqueueRequest({ url: 'http://www.example.com' });
// Return an object with the data extracted from the page.
// It will be stored to the resulting dataset.
return {
url: context.request.url,
pageTitle,
h1,
first_h2,
random_text_from_the_page
};
},
"proxyConfiguration": {
"useApifyProxy": true
},
"initialCookies": [],
"waitUntil": [
"networkidle2"
],
"preNavigationHooks": `// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the "crawlingContext" object
// and "gotoOptions".
[
async (crawlingContext, gotoOptions) => {
// ...
},
]`,
"postNavigationHooks": `// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the "crawlingContext" object.
[
async (crawlingContext) => {
// ...
},
]`,
"breakpointLocation": "NONE",
"customData": {}
};
(async () => {
// Run the actor and wait for it to finish
const run = await client.actor("apify/web-scraper").call(input);
// Fetch and print actor results from the run's dataset (if any)
console.log('Results from dataset');
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.dir(item);
});
})();
from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")
# Prepare the actor input
run_input = {
"runMode": "DEVELOPMENT",
"startUrls": [{ "url": "https://crawlee.dev" }],
"linkSelector": "a[href]",
"globs": [{ "glob": "https://crawlee.dev/*/*" }],
"pseudoUrls": [],
"pageFunction": """// The function accepts a single argument: the \"context\" object.
// For a complete list of its properties and functions,
// see https://apify.com/apify/web-scraper#page-function
async function pageFunction(context) {
// This statement works as a breakpoint when you're trying to debug your code. Works only with Run mode: DEVELOPMENT!
// debugger;
// jQuery is handy for finding DOM elements and extracting data from them.
// To use it, make sure to enable the \"Inject jQuery\" option.
const $ = context.jQuery;
const pageTitle = $('title').first().text();
const h1 = $('h1').first().text();
const first_h2 = $('h2').first().text();
const random_text_from_the_page = $('p').first().text();
// Print some information to actor log
context.log.info(`URL: ${context.request.url}, TITLE: ${pageTitle}`);
// Manually add a new page to the queue for scraping.
await context.enqueueRequest({ url: 'http://www.example.com' });
// Return an object with the data extracted from the page.
// It will be stored to the resulting dataset.
return {
url: context.request.url,
pageTitle,
h1,
first_h2,
random_text_from_the_page
};
}""",
"proxyConfiguration": { "useApifyProxy": True },
"initialCookies": [],
"waitUntil": ["networkidle2"],
"preNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the \"crawlingContext\" object
// and \"gotoOptions\".
[
async (crawlingContext, gotoOptions) => {
// ...
},
]
""",
"postNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the \"crawlingContext\" object.
[
async (crawlingContext) => {
// ...
},
]""",
"breakpointLocation": "NONE",
"customData": {},
}
# Run the actor and wait for it to finish
run = client.actor("apify/web-scraper").call(run_input=run_input)
# Fetch and print actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)
# Set API token
API_TOKEN=<YOUR_API_TOKEN>
# Prepare actor input
cat > input.json <<'EOF'
{
"runMode": "DEVELOPMENT",
"startUrls": [
{
"url": "https://crawlee.dev"
}
],
"linkSelector": "a[href]",
"globs": [
{
"glob": "https://crawlee.dev/*/*"
}
],
"pseudoUrls": [],
"pageFunction": "// The function accepts a single argument: the \"context\" object.\n// For a complete list of its properties and functions,\n// see https://apify.com/apify/web-scraper#page-function \nasync function pageFunction(context) {\n // This statement works as a breakpoint when you're trying to debug your code. Works only with Run mode: DEVELOPMENT!\n // debugger; \n\n // jQuery is handy for finding DOM elements and extracting data from them.\n // To use it, make sure to enable the \"Inject jQuery\" option.\n const $ = context.jQuery;\n const pageTitle = $('title').first().text();\n const h1 = $('h1').first().text();\n const first_h2 = $('h2').first().text();\n const random_text_from_the_page = $('p').first().text();\n\n\n // Print some information to actor log\n context.log.info(`URL: ${context.request.url}, TITLE: ${pageTitle}`);\n\n // Manually add a new page to the queue for scraping.\n await context.enqueueRequest({ url: 'http://www.example.com' });\n\n // Return an object with the data extracted from the page.\n // It will be stored to the resulting dataset.\n return {\n url: context.request.url,\n pageTitle,\n h1,\n first_h2,\n random_text_from_the_page\n };\n}",
"proxyConfiguration": {
"useApifyProxy": true
},
"initialCookies": [],
"waitUntil": [
"networkidle2"
],
"preNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"gotoOptions\".\n[\n async (crawlingContext, gotoOptions) => {\n // ...\n },\n]\n",
"postNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n async (crawlingContext) => {\n // ...\n },\n]",
"breakpointLocation": "NONE",
"customData": {}
}
EOF
# Run the actor
curl "https://api.apify.com/v2/acts/apify~web-scraper/runs?token=$API_TOKEN" \
-X POST \
-d @input.json \
-H 'Content-Type: application/json'
runMode
Optional Enum
The Production mode enables concurrency and timeouts and disables debugging. It denotes the operating mode of the scraper. In the development mode, the tool ignores timeouts of the page, uses Chrome DevTools to enable debugging, opens selected pages one by one, and doesn't use sessionPool. Open the container link or the Live View tab to access the debugging option. Then, you visit the advanced configuration section to configure debugging options.
Visit the Readme tab for more information.
Options:
startUrls
Required Array
It is the static list of links to collect data.
Visit the readme tab for more details about start URLs.
keepUrlFragments
Optional Boolean
It denotes URL fragments like http://example.com#fragment that you should include while verifying whether the scraper has visited the URL. Usually, you can use URL fragments for page navigation; hence, you should pay attention to them since they don't recognize individual pages. But, only some single-page websites use them to show multiple pages; here, you should enable this option.
linkSelector
Optional String
A CSS selecting element says which page link it shall follow and add to the request list. You can use Glob patterns or Pseudo URLs to filter the links added to the request queue.
The scraper ignores page links if the Link selector is blank.
Check the link selector section in the Readme tab.
Glob Patterns
Optional Array
The input parameter Glob patterns help match page URals you wish to enqueue. Eliminating this input field will enable the tool to enqueue each link the Link selector matches. You can combine it with the Link selector and inform the scraper about the location to search URLs.
pseudoUrls
Optional Array
It mentions the type of URLs you need to add to the request waiting list that the Link selector finds. The pseudo-URL links to [] brackets with the following example.
http://www.example.com/[.*].
If the scraper eliminates the Pseudo-URLs, the actor will enqueue each link the link selector matches.
To learn more, check the pseudo-URLs section in the Readme tab.
pageFunction
Required String
The crawler executes the JavaScript (ES6) function for each page in the Chromium-based browser. You can extract web page data, add new links to the queue, and perform necessary actions.
Visit Page Function in the Readme tab.
injectJQuery
Optional Boolean
If you enable this, the crawler will insert the JQuery library to all loaded web pages before invoking the Page Function. Remember that the scraper can't register jQuery object ($) into the worldwide namespace to avoid library conflicts that web pages use. You can only access it through the context.jQuery command in the Page Function.
proxyConfiguration
Required Object
It mentions proxies scraper will use to hide its identity.
Learn more about proxy configuration in the Readme section.
proxyRotation
Optional Enum
You can use a proxy rotation strategy with our proxy server. The suggested setting picks the best possible proxy server from the existing pool, automatically rotates proxies, and eliminates unresponsive or blocked ones. If any proxy fails due to any issues, you can set up the scraper to use a new proxy for every request or use a single proxy for lengthy execution until it fails. Note: The proxy rotation setting needs the pool of proxies on your console account. Hence, you will only get the expected output for the rotation setting if you have sufficient proxies.
Options:
sessionPoolName
Optional String
It uses only underscores, characters, and dashes from English alphanumeric symbols. Its session is user representation. It has cookies and IPs that the scraper can use together and emulate an actual user. Proxy rotation controls the usage of this session. You can allow session sharing to several Scraper runs after providing the pool name of the session. It is valuable if you require particular cookies to access targeted websites if you find multiple blocked proxies in your account. You will see the working session list saved in your account; you can reuse those to execute the scrape freshly instead of randomly trying it. Remember, if you don't use the session again in the same window, the IP lock of that session expires after one day.
initialCookies
Optional Array
Each opened Chrome browser tab will have a cookie setup based on JSON arrays before crawling the page in the acceptable format using the function Page.setCookie() by Puppeteer. It is a valuable option to transfer the logged-in session from any outside browser.
useChrome
Optional Boolean
If you enable this, the crawler will replace the Puppeteer based on the Chromium bundle with the Chrome browser. It may help you bypass specific protections available on source websites for anti-scraping. However, it may make your scraper unstable.
headless
Optional Boolean
Web browsers run in headless mode by default. However, they are expensive and slow. You can turn this off and use them in a headful operating mode. It will help you protect your scraper from anti-scraping protection.
ignoreSslErrors
Optional Boolean
If you enable it, your scraper will ignore all the TLS and SSL errors. However, using it may be risky.
ignoreCorsAndCsp
Optional Boolean
If you enable it, the scrape will ignore a few settings, like Cross-Origin Resource Sharing and Content Security Policy of requested domains and visited pages. It allows you to use XHR or retrieve to create HTTP requests from the function of pages.
downloadMedia
Optional Boolean
If you enable it, your scraper will download the scraped media like fonts, images, sound files, videos, and more. If you disable it, you may increase the scraping speed, but it may stop the correct working of a few websites.
downloadCss
Optional Boolean
If you enable this option, the crawler will download the required CSS files with stylesheets. Further, disabling it may increase the scraping speed, but you may experience lag in working on certain websites.
maxRequestRetries
Optional Integer
The scraper will try to load all web pages in case of errors due to page loading and page function exceptions as often as possible.
If you set the maximum retries count to zero, the page will consider failure after the first error experience.
maxPagesPerCrawl
Optional Integer
It is the maximum page count that the scraper will crawl. The tool will stop after reaching the limit. Seeing the limit to avoid excessive platform use for unsettled scrapers is excellent. Remember that the practical page counts may have higher numbers than the set value.
You'll not see any limit if you set the maximum limit to zero.
maxResultsPerCrawl
Optional Integer
The scraper will store full records of output in the dataset. It will stop after reaching the limit.
You'll not see any limit if you set the maximum limit to zero.
maxCrawlingDepth
Optional Integer
It mentions the link count apart from Start URLs that the tool will descend. The value has safety against unlimited crawling depth for unsettled scrapers. Remember that there is no max crawling depth constraint if the scrape adds pages with the page function context.enqueuePage().
You'll not see any limit if you set the maximum limit to zero.
maxConcurrency
Optional Integer
It mentions the maximum number of pages that the scraper can process parallelly. Then the tool automatically varies the concurrency depending on the system resource availability. It allows you to set up the maximum time to reduce the workload on the web server of the targeted website.
pageLoadTimeoutSecs
Optional Integer
It is the maximum time duration of the scraper to wait for the selected web page to load. The scraper will consider the web page failure if a specific page doesn't load in the stipulated timeframe. If any page fails, the scraper will retry crawling it if a maximum page retires limit is available.
pageFunctionTimeoutSecs
Optional Integer
The maximum time duration in seconds is when the scraper will allow the page function to run. It is good to limit this time duration to make sure the page function will not behave anything to stick the scraper function.
waitUntil
Optional Array
It contains page event names to wait with JSON array before considering the total loading of the web page. Before running the page function, the scraper will allow time for all events to trigger. networkidle0, networkidle2, load, domcontentloaded are the current events.
To learn more, Explore the waitUntil option on the page.goto() function documentation from Puppeteer.
preNavigationHooks
Optional
String
The tool evaluates async functions sequentially before navigating them. It is good to set up different browser properties or cookies before navigation. It accepts gotoOptions and crawlingContext options, which the scraper crawls to pass to the page.goto() function to navigate.
postNavigationHooks
Optional String
The tool evaluates asynchronous functions sequentially after navigating them. It accepts only the parameter crawlingContext. It is good to check the successful navigation.
breakpointLocation
Optional Enum
There is no effect of production run mode on this property. Setting the run mode to development injects a breakpoint at the preferred location in all the crawled pages. The code execution stops at the breakpoint until you manually resume it in the window based on DevTools via Container URL or Live View tab. You can add a debugger to add extra breakpoints and the statement inside your page function.
Check out the Readme to learn more about Run mode.
Options:
debugLog
Optional Boolean
If you enable it, the scraper log will include messages with debug. Beware of its verbose nature. To log your debug message using the page function, use the context.log.debug('message') function.
browserLog
Optional Boolean
If you enable it, the scraper log will add console messages from the JavaScrpt execution by the web pages using console.log(). Beware that you may get warnings, error messages, and other meaningless messages with high concurrency.
customData
Optional Object
context.customData is the JSON object you can pass to the page function. It is a valuable setting while invoking the scraper using API so that you can add arbitrary factors to the code.
datasetName
Optional String
Dataset ID or Name that you can use to store outputs. If you leave it blank, the scraper will use the default dataset.
keyValueStoreName
Optional String
You can use the name of the Key-value store to store outputs in the dataset. If you leave it blank, the scraper will use the default dataset.
requestQueueName
Optional String
Request queue name that you can use to store outputs. If you leave it blank, the scraper will use the default dataset.
{
"runMode": "DEVELOPMENT",
"startUrls": [
{
"url": "https://crawlee.dev"
}
],
"keepUrlFragments": false,
"linkSelector": "a[href]",
"globs": [
{
"glob": "https://crawlee.dev/*/*"
}
],
"pseudoUrls": [],
"pageFunction": "// The function accepts a single argument: the \"context\" object.\n// For a complete list of its properties and functions,\n// see https://apify.com/apify/web-scraper#page-function \nasync function pageFunction(context) {\n // This statement works as a breakpoint when you're trying to debug your code. Works only with Run mode: DEVELOPMENT!\n // debugger; \n\n // jQuery is handy for finding DOM elements and extracting data from them.\n // To use it, make sure to enable the \"Inject jQuery\" option.\n const $ = context.jQuery;\n const pageTitle = $('title').first().text();\n const h1 = $('h1').first().text();\n const first_h2 = $('h2').first().text();\n const random_text_from_the_page = $('p').first().text();\n\n\n // Print some information to actor log\n context.log.info(`URL: ${context.request.url}, TITLE: ${pageTitle}`);\n\n // Manually add a new page to the queue for scraping.\n await context.enqueueRequest({ url: 'http://www.example.com' });\n\n // Return an object with the data extracted from the page.\n // It will be stored to the resulting dataset.\n return {\n url: context.request.url,\n pageTitle,\n h1,\n first_h2,\n random_text_from_the_page\n };\n}",
"injectJQuery": true,
"proxyConfiguration": {
"useApifyProxy": true
},
"proxyRotation": "RECOMMENDED",
"initialCookies": [],
"useChrome": false,
"headless": true,
"ignoreSslErrors": false,
"ignoreCorsAndCsp": false,
"downloadMedia": true,
"downloadCss": true,
"maxRequestRetries": 3,
"maxPagesPerCrawl": 0,
"maxResultsPerCrawl": 0,
"maxCrawlingDepth": 0,
"maxConcurrency": 50,
"pageLoadTimeoutSecs": 60,
"pageFunctionTimeoutSecs": 60,
"waitUntil": [
"networkidle2"
],
"preNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"gotoOptions\".\n[\n async (crawlingContext, gotoOptions) => {\n // ...\n },\n]\n",
"postNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n async (crawlingContext) => {\n // ...\n },\n]",
"breakpointLocation": "NONE",
"debugLog": false,
"browserLog": false,
"customData": {}
}