Disclaimer : Real Data API only extracts publicly available data while maintaining a strict policy against collecting any personal or identity-related information.
Scrape Twiter Data about users, including user profiles, follower count, followings, hashtags, tweets, retweets, threads, images, statistics, videos, history, replies, and other data fields using Twitter Data Scraper. Our scraper to extract Twitter data is accessible in multiple countries, including Canada, France, Australia, Germany, the USA, the UK, Spain, etc.
Twitter Scraper loads mentioned Twitter URLs and profiles to scrape the below data.
Twitter Scraper on our platform allows you to scrape Twitter data at scale. It also allows scraping data more than the official Twitter API because you don't need a registered application, Twitter account, or API key, and it has no restrictions.
You can load the source platform for the Twitter handles list or use Twitter links like trending topics, searches, or hashtags.
Crawling the Twitter platform will give you access to over five hundred million tweets daily. You can collect any required data in multiple ways.
To learn more about using this Twitter Scraper, check out our stepwise tutorial or watch the video.
Yes, you can extract publicly available data from Twitter. But you must note that you may get private data in your output. GDPR and other regulations worldwide protect personal data, respectively. They don't allow you to extract personal information without genuine reason or prior permission. You can consult your lawyers if you are confused or unsure whether your reason is genuine.
If you wish to extract specific Twitter data quickly, try the targeted Twitter data scraper options below.
The scraper has the default option to extract using search queries, but you can also try Twitter URLs or Twitter handles. If you plan to use the URL option, check the below allowable URL types.
The option to log in using cookies allows you to use the already initialized cookies of the existing user. If you try this option, the scraper will try to avoid the block from the source platform. For example, the scraper will reduce the running speed and introduce a random delay between two actions.
We highly recommend you don't use a private account to run the scraper unless there is no other option. Instead, you can create a new Twitter account so that Twitter won't ban your personal account.
Use Chrome browser extensions like EditThisCookie to log in using existing cookies. Once you install it, open the source platform in your browser, login into Twitter using credentials, and export cookies using a browser extension. It will give you a cookie array to use as an input value login cookie while logging in.
If you try to log out from the Twitter account with the submitted cookies, the scraper will invalidate them, and the scraper will stop its execution.
Check out the below video tutorial to sort it out.
Here are the input parameters for Twitter Scraper API.
You can export the scraped dataset in multiple digestible formats like CSV, JSON, Excel, or HTML. Every item in the scraped data set contains a different tweet in the following format.
[{
"user": {
"protected": false,
"created_at": "2009-06-02T20:12:29.000Z",
"default_profile_image": false,
"description": "",
"fast_followers_count": 0,
"favourites_count": 19158,
"followers_count": 130769125,
"friends_count": 183,
"has_custom_timelines": true,
"is_translator": false,
"listed_count": 117751,
"location": "",
"media_count": 1435,
"name": "Elon Musk",
"normal_followers_count": 130769125,
"possibly_sensitive": false,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/44196397/1576183471",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/1590968738358079488/IY9Gx6Ok_normal.jpg",
"screen_name": "elonmusk",
"statuses_count": 23422,
"translator_type": "none",
"verified": true,
"withheld_in_countries": [],
"id_str": "44196397"
},
"id": "1633026246937546752",
"conversation_id": "1632363525405392896",
"full_text": "@MarkChangizi Sweden’s steadfastness was incredible!",
"reply_count": 243,
"retweet_count": 170,
"favorite_count": 1828,
"hashtags": [],
"symbols": [],
"user_mentions": [
{
"id_str": "49445813",
"name": "Mark Changizi",
"screen_name": "MarkChangizi"
}
],
"urls": [],
"media": [],
"url": "https://twitter.com/elonmusk/status/1633026246937546752",
"created_at": "2023-03-07T08:46:12.000Z",
"is_quote_tweet": false,
"replying_to_tweet": "https://twitter.com/MarkChangizi/status/1632363525405392896",
"startUrl": "https://twitter.com/elonmusk/with_replies"
},
{
"user": {
"protected": false,
"created_at": "2009-06-02T20:12:29.000Z",
"default_profile_image": false,
"description": "",
"fast_followers_count": 0,
"favourites_count": 19158,
"followers_count": 130769125,
"friends_count": 183,
"has_custom_timelines": true,
"is_translator": false,
"listed_count": 117751,
"location": "",
"media_count": 1435,
"name": "Elon Musk",
"normal_followers_count": 130769125,
"possibly_sensitive": false,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/44196397/1576183471",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/1590968738358079488/IY9Gx6Ok_normal.jpg",
"screen_name": "elonmusk",
"statuses_count": 23422,
"translator_type": "none",
"verified": true,
"withheld_in_countries": [],
"id_str": "44196397"
},
"id": "1633021151197954048",
"conversation_id": "1632930485281120256",
"full_text": "@greg_price11 @Liz_Cheney @AdamKinzinger @RepAdamSchiff Besides misleading the public, they withheld evidence for partisan political reasons that sent people to prison for far more serious crimes than they committed./n/nThat is deeply wrong, legally and morally.",
"reply_count": 727,
"retweet_count": 2458,
"favorite_count": 10780,
"hashtags": [],
"symbols": [],
"user_mentions": [
{
"id_str": "896466491587080194",
"name": "Greg Price",
"screen_name": "greg_price11"
},
{
"id_str": "98471035",
"name": "Liz Cheney",
"screen_name": "Liz_Cheney"
},
{
"id_str": "18004222",
"name": "Adam Kinzinger #fella",
"screen_name": "AdamKinzinger"
},
{
"id_str": "29501253",
"name": "Adam Schiff",
"screen_name": "RepAdamSchiff"
}
],
"urls": [],
"media": [],
"url": "https://twitter.com/elonmusk/status/1633021151197954048",
"created_at": "2023-03-07T08:25:57.000Z",
"is_quote_tweet": false,
"replying_to_tweet": "https://twitter.com/greg_price11/status/1632930485281120256",
"startUrl": "https://twitter.com/elonmusk/with_replies"
}]
...
Use this type of pre-designed search with Advanced Search as a starting link, like https://twitter.com/search?q=cool%20until%3A2021-01-01&src=typed_query
Twitter returns only 3200 tweet posts per search or profile by default. If you require more tweets than the maximum limit, you can split your starting links using time slices as the below URL samples.
https://twitter.com/search?q=(from%3Aelonmusk)%20since%3A2020-03-01%20until%3A2020-04-01&src=typed_query&f=live
https://twitter.com/search?q=(from%3Aelonmusk)%20since%3A2020-02-01%20until%3A2020-03-01&src=typed_query&f=live
https://twitter.com/search?q=(from%3Aelonmusk)%20since%3A2020-01-01%20until%3A2020-02-01&src=typed_query&f=live
Each link is from the same account - Elon Musk, but we separated them by a 30-day monthly timeframe, like January > February > March 2020. You can create it using the advanced search option on Twitter. https://twitter.com/search
If you want, you can use larger time intervals for a few accounts that don't post regularly.
Other restrictions contain-
This output parameter function allows you to change your dataset output shape, split data arrays into different items, or categorize the output.
async ({ data, item, request }) => {
item.user = undefined; // removes this field from the output
delete item.user; // this works as well
const raw = data.tweets[item['#sort_index']]; // allows you to access the raw data
item.source = raw.source; // adds "Twitter for ..." to the output
if (request.userData.search) {
item.search = request.userData.search; // add the search term to the output
item.searchUrl = request.loadedUrl; // add the raw search URL to the output
}
return item;
}
Item filtering:
async ({ item }) => {
if (!item.full_text.includes('lovely')) {
return null; // omit the output if the tweet body doesn't contain the text
}
return item;
}
Separating into multiple data items and changing the entire result:
async ({ item }) => {
// dataset will be full of items like { hashtag: '#somehashtag' }
// returning an array here will split in multiple dataset items
return item.hashtags.map((hashtag) => {
return { hashtag: `#${hashtag}` };
});
}
This factor permits you to extend scraper working and can simplify extending the default scraper function without owning a custom version. For instance, you can contain a trending topic search for every page visit.
async ({ page, request, addSearch, addProfile, addThread, customData }) => {
await page.waitForSelector('[aria-label="Timeline: Trending now"] [data-testid="trend"]');
const trending = await page.evaluate(() => {
const trendingEls = $('[aria-label="Timeline: Trending now"] [data-testid="trend"]');
return trendingEls.map((_, el) => {
return {
term: $(el).find('> div > div:nth-child(2)').text().trim(),
profiles: $(el).find('> div > div:nth-child(3) [role="link"]').map((_, el) => $(el).text()).get()
}
}).get();
});
for (const { search, profiles } of trending) {
await addSearch(search); // add a search using text
for (const profile of profiles) {
await addProfile(profile); // adds a profile using link
}
}
// adds a thread and get replies. can accept an id, like from conversation_id or an URL
// you can call this multiple times but will be added only once
await addThread("1351044768030142464");
}
extendScraperFunction contains additional data variables.
async ({ label, response, url }) => {
if (label === 'response' && response) {
// inside the page.on('response') callback
if (url.includes('live_pipeline')) {
// deal with plain text content
const blob = await (await response.blob()).text();
}
} else if (label === 'before') {
// executes before the page.on('response'), can be used for intercept request/response
} else if (label === 'after') {
// executes after the scraping process has finished, even on crash
}
}
Lastly, using Real Data API Integrations, you can connect Twitter Scraper with almost any web application or cloud service. You can connect with Google Drive, Google Sheets, Airbyte, Make, Slack, GitHub, Zapier, etc. Further, you can use Webhooks to carry out an activity once an event occurs, like an alert when Twitter Scraper completes its execution.
The Real Data API platform gives you programmatic permission to use scrapers. We have organized the Twitter Scraper API around RESTful HTTP endpoints to allow you to schedule, manage, and run Real Data API Scrapers. The actor also lets you track actor performance, create and update versions, access datasets, retrieve results, and more.
To use the scraper using Python, try our client PyPl package, and to use it using Node.js, try our client NPM package.
Check out the API tab for code examples or explore Real Data API reference documents for details.
You should have a Real Data API account to execute the program examples. Replace < YOUR_API_TOKEN >
in the program using the token of your actor. Read about the live APIs with Real Data API docs for more explanation.
import { RealdataAPIClient } from 'RealdataAPI-Client';
// Initialize the RealdataAPIClient with API token
const client = new RealdataAPIClient({
token: '<YOUR_API_TOKEN>',
});
// Prepare actor input
const input = {
"searchTerms": [
"RealdataAPI"
],
"searchMode": "live",
"profilesDesired": 10,
"tweetsDesired": 100,
"mode": "replies",
"proxyConfig": {
"useRealdataAPIProxy": true
},
"extendOutputFunction": async ({ data, item, page, request, customData, RealdataAPI }) => {
return item;
},
"extendScraperFunction": async ({ page, request, addSearch, addProfile, _, addThread, addEvent, customData, RealdataAPI, signal, label }) => {
},
"customData": {},
"handlePageTimeoutSecs": 500,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60
};
(async () => {
// Run the actor and wait for it to finish
const run = await client.actor("quacker/twitter-scraper").call(input);
// Fetch and print actor results from the run's dataset (if any)
console.log('Results from dataset');
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.dir(item);
});
})();
from RealdataAPI_client import RealdataAPIClient
# Initialize the RealdataAPIClient with your API token
client = RealdataAPIClient("<YOUR_API_TOKEN>")
# Prepare the actor input
run_input = {
"searchTerms": ["RealdataAPI"],
"searchMode": "live",
"profilesDesired": 10,
"tweetsDesired": 100,
"mode": "replies",
"proxyConfig": { "useRealdataAPIProxy": True },
"extendOutputFunction": """async ({ data, item, page, request, customData, RealdataAPI }) => {
return item;
}""",
"extendScraperFunction": """async ({ page, request, addSearch, addProfile, _, addThread, addEvent, customData, RealdataAPI, signal, label }) => {
}""",
"customData": {},
"handlePageTimeoutSecs": 500,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60,
}
# Run the actor and wait for it to finish
run = client.actor("quacker/twitter-scraper").call(run_input=run_input)
# Fetch and print actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)
# Set API token
API_TOKEN=<YOUR_API_TOKEN>
# Prepare actor input
cat > input.json <<'EOF'
{
"searchTerms": [
"RealdataAPI"
],
"searchMode": "live",
"profilesDesired": 10,
"tweetsDesired": 100,
"mode": "replies",
"proxyConfig": {
"useRealdataAPIProxy": true
},
"extendOutputFunction": "async ({ data, item, page, request, customData, RealdataAPI }) => {/n return item;/n}",
"extendScraperFunction": "async ({ page, request, addSearch, addProfile, _, addThread, addEvent, customData, RealdataAPI, signal, label }) => {/n /n}",
"customData": {},
"handlePageTimeoutSecs": 500,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60
}
EOF
# Run the actor
curl "https://api.RealdataAPI.com/v2/acts/quacker~twitter-scraper/runs?token=$API_TOKEN" /
-X POST /
-d @input.json /
-H 'Content-Type: application/json'
searchTerms
Optional Array
The scraped will discover and scrape tweets for specific search terms that you add before starting the execution. If you wish to search hashtags, begin the search term with #; for example, if you want to search data analytics, search for #dataanalytics. Otherwise, scroll down to scrape by URL or Twitter user profile.
searchMode
Optional String
The setting will transform the method of sorting Twitter data before the scraper uses it to extract. For example, people, top tweets, videos, photos, or latest tweets.
"top"
,"image"
,"user"
,"video"
profilesDesired
Optional Integer
Restrict the profile count to scrape. It helps if you want to extract several tweets from selected profiles.
tweetsDesired
Optional Integer
This value allows you to set the maximum tweet count to scrape for every search term.
addTweetViewCount
Optional Boolean
It allows you to retrieve tweet views count.
addUserInfo
Optional Boolean
Stretches the user data according to Tweets. You can. Reduce the dataset size by turning this feature off.
useCheerio
Optional Boolean
Instead of Puppeteer, you can use Cheerio to scrape tweets. Cheerio is capable of scraping all tweet posts quickly.
handle
Optional Array
You feed specific Twitter profile handles you want to scrape. Use this shortcut so that you don't need to add the complete URL of the username, like https://twitter.com/username.
mode
Optional String
You can only scrape tweets or also scrape replies with tweets. Remember that it only applies while Twitter profile scraping.
startUrls
Optional Array
It helps you input the scraper with the start location. You can one by one Twitter links and enter a single file link with multiple links.
toDate
Optional String
Extract the latest tweets after a specific date in a format like YYYY-MM-DD. Use this in tweets older than with to make a time bounded slice.
fromDate
Optional String
Extract older tweets before a specific date in a format like YYYY-MM-DD. Use this in tweets newer than with to make a time bounded slice.
useAdvancedSearch
Instead of using default search, try advanced search. It helps to extract tweets using user handles, search terms, or data range. Remember that this option doesn't scrape retweets.
proxyConfig
Required Object
If you use Real Data API proxy, configure proxy servers to help your scraper.
extendOutputFunction
Optional String
Remove or add properties on the result object, or remove the result returning zero value.
extendScraperFunction
Optional String
It is an advanced function that allows you to stretch default API functionality, enabling you to perform page actions manually.
customData
Optional Object
Any information you wish to get inside the extend scraper or output function.
handlePageTimeoutSecs
Optional Integer
You can increase lengthy processes with maximum Timeout handlePageFunction.
maxRequestRetries
Optional Integer
Fix the maximum request attempts.
maxIdleTimeoutSecs
Optional Integer
Setup time duration for which you didn't receive any data.
debugLog
Optional Boolean
Allow debugging log.
initialCookies
Optional Array
The scraper will use login cookies to bypass the login wall. To know more, check the readme tab.
browserFallback
Optional Boolean
{
"searchTerms": [
"RealdataAPI"
],
"searchMode": "live",
"profilesDesired": 10,
"tweetsDesired": 100,
"addTweetViewCount": true,
"addUserInfo": true,
"useCheerio": true,
"mode": "replies",
"startUrls": [],
"useAdvancedSearch": false,
"proxyConfig": {
"useRealdataAPIProxy": true
},
"extendOutputFunction": "async ({ data, item, page, request, customData, RealdataAPI }) => {/n return item;/n}",
"extendScraperFunction": "async ({ page, request, addSearch, addProfile, _, addThread, addEvent, customData, RealdataAPI, signal, label }) => {/n /n}",
"customData": {},
"handlePageTimeoutSecs": 500,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60,
"debugLog": false
}