Disclaimer : Real Data API only extracts publicly available data while maintaining a strict policy against collecting any personal or identity-related information.
GPT Scraper excels in extracting and manipulating data from any website. Leverage the power of GPT data scraping via OpenAI's API for enhanced content analysis, sentiment proofreading, review summarization, and more.
GPT Scraper operates by initially loading a webpage with Playwright. It then transforms the content into markdown format and seeks GPT instructions for markdown content. If the content exceeds GPT's limit, the scraper truncates it, with details about truncation available in the log. This process exemplifies the efficiency of GPT data scraping.
For a heightened GPT Scraper experience, explore Extended GPT Scraper. This advanced tool allows you to choose your preferred GPT model and offers additional features for enhanced functionality.
Initiating GPT Scraper involves configuring the pages for scraping through Start URLs. Set up instructions on how the GPT scraper should interact with each page. For example, directing a basic scraper to load the URL https://news.ycombinator.com/ and instructing GPT to extract information appears as follows:
GPT Scraper offers various configuration settings, which can be inputted manually through the user interface in Real Data API Console or programmatically via a JSON object using the Real Data API. For a comprehensive list of input fields and their types, refer to the Actor's Input-schema outline.
The Start URLs (startUrls) field comprises the initial set of page URLs for the scraper to navigate. You can input multiple URLs via file upload or individually. Additionally, the scraper facilitates adding new URLs dynamically through options like Link selector and Glob patterns.
The Link selector (linkSelector) field is a CSS selector used to locate links to other web pages (items with href attributes like <div class="my-class" href="..." >).
When each page loads, the scraper searches for links matching the Link selector and ensures the target URL aligns with any specified Glob patterns. Upon a match, the URL joins the request queue for subsequent scraping. If the Link selector is left empty, the scraper ignores page links, focusing solely on loading pages specified in Start URLs.
A Glob pattern (globs) designates the URL types identified by the Link selector to be added to the request queue. It is a string with wildcard characters.
"For example, a glob pattern like http://www.example.com/pages/**/* will encompass URLs such as:"
http://www.example.com/pages/something
http://www.example.com/pages/my-awesome-page
http://www.example.com/pages/deeper-level/page
Optimize GPT Configuration: Employ the 'Prompts and Instructions' feature to direct GPT in managing page content. Provide specific prompts for customized interactions, such as summarizing content or extracting specific information like sentences containing 'Real Data API Proxy.' Additionally, instruct GPT to respond with 'skip this page' for selective processing of scraped content, offering a versatile approach to data extraction.
This parameter determines the number of links the scraper will traverse away from the Start URLs. It acts as a safeguard, preventing infinite crawling depths in case of misconfigured scrapers.
The scraper's limit on opening pages is defined by the maximum page count. Setting it to 0 implies an unlimited number of pages.
If you aim to obtain structured data, utilize the Schema input option in the GPT scraper. Enable the 'Use JSON schema to format answer' option to structure data into a JSON object, stored in the output's jsonAnswer attribute. This approach enhances GPT data scraping.
The Proxy configuration (proxyConfiguration) feature allows you to configure proxies for the GPT scraper. These proxies, including Real Data API Proxy and custom HTTP or SOCKS5 proxy servers, help prevent detection by target websites.
The GPT model imposes a limit on content handling, known as the maximum token limit. When this limit is reached, scraped content undergoes truncation. For enhanced capability, explore Extended GPT Scraper, allowing the use of more than 4096 tokens.
Unveiling GPT Scraper's Hidden Gems
Exploring GPT Scraping Use Cases
Delve into these sample scenarios to kickstart your GPT scraping experiments.
For instance, summarize a page efficiently by providing GPT with a start URL, like https://en.wikipedia.org/wiki/COVID-19_pandemic, and instructing it to generate a concise three-sentence summary.
[
{
"url": "https://en.wikipedia.org/wiki/COVID-19_pandemic",
"answer": "Explore this Wikipedia page for in-depth details on the COVID-19 pandemic, covering epidemiology, symptoms, prevention strategies, history, national responses, and actions by organizations like WHO and UN. The information is well-organized with subsections for easy navigation.",
"jsonAnswer": null
}
]
[
{
"url": "https://blog.RealDataAPI.com/step-by-step-guide-to-scraping-amazon/",
"answer": "Explore this tutorial on web scraping Amazon with Real Data API. Learn about automation, data extraction, product data, API usage, proxies, and more.",
"jsonAnswer": null
}
]
Summarize reviews of movies, games, or products
Start URL:
https://www.imdb.com/title/tt10366206/reviews
Instructions for GPT:
Analyze all user reviews for this movie and summarize the consensus.
Results:
[{
"url": "https://www.imdb.com/title/tt10366206/reviews",
"answer": "User reviews for John Wick: Chapter 4 (2023) highlight its exceptional action scenes and Donnie Yen's performance. While praised for creativity, some noted minor flaws like an anticlimactic ending.",
"jsonAnswer": null
}]
Experiment with extracting contact details from a web page using GPT.
For example, provide GPT with the start URL https://RealDataAPI.com/contact and instruct it to return the contact information as JSON, including attributes such as companyEmail, companyWeb, githubUrl, twitterUrl, vatId, businessId, and bankAccountNumber."
[
{
"url": "https://RealDataAPI.com/contact",
"answer": "Contact RealDataAPI using the following details:/n- Email: hello@RealDataAPI.com/n- Website: https://RealDataAPI.com/n- GitHub: https://github.com/RealDataAPI/n- Twitter: https://twitter.com/RealDataAPI/n- VAT ID: CZ04788290/n- Business ID: 04788290/n- Bank Account Number: CZ0355000000000027434378",
"jsonAnswer": {
"companyEmail": "hello@RealDataAPI.com",
"companyWeb": "https://RealDataAPI.com",
"githubUrl": "https://github.com/RealDataAPI",
"twitterUrl": "https://twitter.com/RealDataAPI",
"vatId": "CZ04788290",
"businessId": "04788290",
"backAccountNumber": "CZ0355000000000027434378"
}
}
]
For more details, contact Real Data API today!
Check out how industries are using Airbnb Data Scraper around the world.
E-commerce & Retail