Disclaimer : Real Data API only extracts publicly available data while maintaining a strict policy against collecting any personal or identity-related information.
Efficiently crawl websites using headless Chrome and the Puppeteer library with the provided server-side Node.js code. This crawler serves as an alternative to Real Data API/web-scraper, offering greater control over the process. It facilitates recursive crawling, URL lists, and supports website login functionalities.
Puppeteer Scraper stands as a formidable tool in our arsenal, leveraging a Puppeteer library for programmatically controlling a headless Chrome browser. This versatile scraper can accomplish nearly anything, making it the go-to choice when Web Scraper falls short for your use case. Given that Puppeteer is a Node.js library, familiarity with Node.js and its paradigms is essential for effective utilization. For swifter or simpler alternatives, explore Cheerio Scraper for optimization and speed, or Web Scraper for ease of use.
Refer to our pricing page to determine the average usage cost for this actor. Cheerio Scraper aligns with easy HTML pages, whereas Puppeteer Scraper, Web Scraper, and Playwright Scraper align with all web pages. Keep in mind that these cost estimates are averages and may vary based on the complexity and weight of the pages you intend to scrape.
Initiating Puppeteer Scraper requires a straightforward process. Begin by specifying Start URLs, indicating the web pages the scraper should load. Define how the scraper should handle each request and extract data from the pages.
The scraper commences by loading pages specified in the Start URLs input setting. To enable dynamic link following, set a Link selector, Glob Patterns, and Pseudo-URLs, instructing the scraper on which links to add to the crawler's request queue. This proves beneficial for the recursive crawling of entire websites, such as discovering all products in an online store.
To guide the scraper on request handling and data extraction, furnish a Page function. Optionally, include arrays of Pre-navigation hooks and Post-navigation hooks. These components consist of JavaScript code executed in the Node.js environment. Leveraging the full-featured Chromium browser, the scraper allows client-side logic execution within the web page context using the page object within the Page function's scope.
In essence, Puppeteer Scraper operates through the following steps:
Adding URLs to the Request Queue:
Incorporates each URL from Start URLs into the request queue.
For Each Request:
Evaluates all hooks in Pre-navigation hooks.
Executes the Page function on the loaded page.
Optionally identifies links from the page using Link selector. If a link adheres to any of the specified Glob Patterns and/or Pseudo URLs and hasn't been requested, it's added to the queue.
Assesses Post-navigation hooks.
Repeat or Finish:
If additional items are in the queue, repeats step 2.
Concludes the crawl if no more items are in the queue.
Puppeteer Scraper offers various configuration settings to enhance performance, manage cookies for website login, mask the web browser, and more. Refer to the Advanced configuration section below for the comprehensive list of settings.
The actor utilizes a fully-featured Chromium web browser, which can be resource-intensive. This may be excessive for websites that don't dynamically render content using client-side JavaScript. For improved performance on such sites, consider using Cheerio Scraper, which processes raw HTML pages without the overhead of a web browser.
Puppeteer Scraper might be too complex for developers with limited experience. If you prefer a more straightforward setup process, explore Web Scraper, which utilizes Puppeteer but removes some of the complexity.
The Puppeteer Scraper actor accepts various configuration settings on input. These settings can be provided either manually through the user interface in the Real Data API Console or programmatically using a JSON object through the Real Data API. Refer to the actor's input schema outline for a comprehensive list of input fields and their types.
The Start URLs (startUrls) field signifies the initial set of page URLs that the scraper will visit. You can input these URLs manually, one by one, upload them in a CSV file, or link them from a Google Sheet document. Ensure each URL begins with either "http://" or "https://" protocol prefix.
The scraper allows dynamic addition of new URLs during scraping. This can be achieved through options like Glob Patterns, Link selector or Pseudo-URLs by using context.enqueueRequest() within the Page function.
Optionally, you can associate each URL with custom user data—a JSON object referenced from your JavaScript code in the Page function with context.request.userData. This proves useful for distinguishing the currently loaded start URL, enabling page-particular actions. For instance, while crawling an online store, different actions may be desired on a page listing products compared to a product detail page. Refer to the Web scraping tutorial in the Real Data API documentation for more details.
The fields of Link Selector (linkSelector) encompasses a CSS selector designed to locate links to some other web pages, typically items having href attributes
( e.g., < div class="my-class" href="..." > )
Upon loading each page, the scraper scans for different links that match the Link Selector and verifies that a targeted URL aligns with one of the specified Glob Patterns or Pseudo-URLs. If there's a match, the URL is added to the request queue for subsequent loading by the scraper.
By default, newly created scrapers feature a selector that matches all links on any page.
a[href]
When the Link Selector becomes empty, the scraper disregards page links. Instead, it solely loads pages listed in Start URLs or those added manually to a request queue using await context.enqueueRequest() within the Page function.
The Glob Patterns or (globs) field outlines the kinds of URLs identified by the Link selector to be included in the request line.
Any glob pattern is essentially a string with wildcard characters. Importantly, using the Glob Patterns setting is optional. You retain full control over the pages the scraper accesses by utilizing await context.enqueueRequest() within the Page function.
The "pseudoUrls" field specifies the types of URLs that the Link selector should add to the request queue. The pseudo-URLs are the URLs that contain different directives given in [] brackets. Presently, the only maintained directive is [regexp] that allows you to define a JavaScript-style regular expression for matching against the URL.
Optionally, you can associate every pseudo-URL having user data, which could be referenced in the Page function with "context.request.label" for determining the type of page currently loaded in the browser.
It's important to note that using the Pseudo-URLs setting is not mandatory. You have the flexibility to control which pages the scraper accesses by using "await context.enqueueRequest()" within the Page function.
For pages where desired links aren't encapsulated within elements featuring href attributes, leverage the Clickable Element Selector. This involves passing a CSS Selector that aligns with elements leading to the targeted URL.
Following the execution of the page function, the scraper will simulate a mouse click on the specified CSS selector. Any ensuing requests, navigations, or opened tabs will be intercepted, and the URLs involved will undergo filtering using Globs and/or Pseudo URLs. Subsequently, these filtered URLs will be incorporated into the request queue. To prevent the scraper from clicking on the page, leave this field empty. It's important to be aware that using this setting may impact performance.
In the context of the Page function, the page function context is represented as follows:
const context = {
// USEFUL DATA
input, // Input data in JSON format.
env, // Contains information about the run, such as actorId and runId.
customData, // Value of the 'Custom data' scraper option.
// EXPOSED OBJECTS
page, // Puppeteer.Page object.
request, // Crawlee.Request object.
response, // Response object holding the status code and headers.
session, // Reference to the currently used session.
proxyInfo, // Object holding the url and other information about currently used Proxy.
crawler, // Reference to the crawler object, with access to `browserPool`, `autoscaledPool`, and more.
globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
log, // Reference to Crawlee.utils.log.
Actor, // Reference to the Actor class of Real Data API SDK.
Real Data API, // Alias to the Actor class for back compatibility.
// EXPOSED FUNCTIONS
setValue, // Reference to the Actor.setValue() function.
getValue, // Reference to the Actor.getValue() function.
saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
skipLinks, // Prevents enqueueing more links via Glob patterns/Pseudo URLs on the current page.
enqueueRequest, // Adds a page to the request queue.
// PUPPETEER CONTEXT-AWARE UTILITY FUNCTIONS
injectJQuery, // Injects the jQuery library into a Puppeteer page.
sendRequest, // Sends request using got-scraping.
parseWithCheerio, // Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.
};
The Proxy Configuration (proxyConfiguration) option empowers you to configure proxies for the scraper, mitigating the risk of detection by target websites. This can involve both Real Data API Proxy and custom HTTP or SOCKS5 proxy servers.
Proxy configuration is essential for the scraper to function. Here are the available options for the proxy configuration setting:
Real Data API Proxy Options:
proxyUrls: An array of Real Data API Proxy URLs.
Custom Proxy Options:
use Real data API Proxy: A boolean flag indicating whether to use the Real data API Proxy service.
httpProxyUrl: The URL of a custom HTTP proxy server.
socksProxyUrl: The URL of a custom SOCKS5 proxy server.
Ensure that a suitable proxy configuration is set to facilitate the proper functioning of the scraper.
The scraper offers various configurations for utilizing Real Data API Proxy:
The scraper automatically loads web pages using Real Data API Proxy in automatic mode. It leverages all available proxy groups, selecting the least recently used proxy for a specific hostname to minimize the risk of detection. The user can view proxy group details on the Proxy page in the Real Data API Console.
Web pages are loaded using Real Data API Proxy with specific target proxy server groups. Users can specify the desired proxy groups for this configuration.
The scraper utilizes a customized listing of proxy servers detailed in the format of scheme://user:password@host:port. Different proxies should get separated by the space or new lines. A URL scheme could be either socks5 or http. Your username with password could get mislaid if the proxy doesn't need authorization, however the port has to always be present.
The Pre-navigation Hooks consist of an array of functions executed before the primary pageFunction runs. Each function receives a context object similar to the one passed into the pageFunction. Additionally, a second "DirectNavigationOptions" object is provided. In this context, Real Data API serves as an alias for the Actor class.
Here are the available options for the DirectNavigationOptions:
timeoutMillis: Time limit for the entire navigation process in milliseconds.
waitUntil: Specifies when the scraper should consider navigation successful. Options include "load," "domcontentloaded," "networkidle0," or "networkidle2."
These hooks offer a way to execute custom logic or modify navigation options before the main scraping function is initiated.
preNavigationHooks: [
async ({ id, request, session, proxyInfo, customData, Actor, Real Data API }, { timeout, waitUntil, referer }) => {}
]
For a deeper understanding of the Pre-navigation Hooks and the PuppeteerHook type, refer to the documentation. It provides extensive information about the objects passed into these functions. Notably, the available properties are extended with the Actor (formerly Real Data API) class and customData specific to this scraper.
Delve into the documentation for detailed insights into these objects and their properties, enabling you to leverage advanced configurations effectively.
Post-navigation Hooks represent an array of functions executed after the primary pageFunction concludes. The sole available parameter for these functions is the PuppeteerCrawlingContext object. In this context, properties are extended with Actor (alternatively Real Data API) and customData specific to this scraper. Notably, Real Data API serves as an alias for the Actor class in this scenario.
These hooks offer a convenient way to execute custom logic or perform actions following the completion of the main scraping function. For comprehensive details, refer to the documentation.
postNavigationHooks: [
async ({ id, request, session, proxyInfo, response, customData, Actor, Real Data API }) => {}
]
For a comprehensive understanding of Post-navigation hooks and the PuppeteerHook type, consult the documentation. It provides detailed insights into the objects passed into these functions, enabling you to harness their capabilities effectively.
If you leave the storage unnamed, the data within it will persist on the Real Data API platform for a duration corresponding to your plan, after which it would expire. Named storages get retained indefinitely.
Utilizing a named storage permits data accumulation across different runs, providing a centralized dataset for enhanced organization and sharing capabilities. Refer to the documentation for further details.
The scraping results, as returned by the Page function, are stored in the default dataset associated with the actor run. These results can be exported to various formats such as JSON, XML, CSV, or Excel.
Each object returned by the Page function corresponds to a record in the dataset, enriched with metadata including the URL of the web page that the results originated from. Explore the documentation for a more detailed understanding.
The complete object stored in the dataset includes metadata fields #error and #debug. Here is an example in JSON format:
{
"title": "Web Scraping, Data Extraction and Automation - Real Data API",
"#error": false,
"#debug": {
"requestId": "fvwscO2UJLdr10B",
"url": "https://Real Data API.com",
"loadedUrl": "https://Real Data API.com/",
"method": "GET",
"retryCount": 0,
"errorMessages": null,
"statusCode": 200
}
}
If you want to get the results in different formats, you can modify the format query parameter to options such as XML, CSV, HTML, etc. For more details on datasets, you can refer to the documentation or the Get dataset items endpoint in the Real Data API reference.
By exploring these options, you can efficiently retrieve and manipulate your scraped data.
Check out how industries are using Puppeteer Scraper around the world.
E-commerce & Retail