Mining Twitter with Selenium

This is great for freaking people out. It looks like a ghost is typing in your web browser.

Web crawling using html parsers to grab links and navigate to new pages with the requests library is all very well, but when you want to physically submit search terms, or login details, or click buttons (etc.) the Selenium library is loads of fun – it literally automates (“drives”) a web browser in real time and you can watch it work (and scrape the pages at the same time).

selenium-logo

It’s pip installable:


pip install selenium

Selenium can automate a range of web browsers, but which ever one you choose you’ll also need a separate binary called a driver. I’m going to use Chrome as my web browser, so I’ll use the chromedriver binary, which I downloaded from here and copied into my /usr/local/bin directory.

Note: you also need to have the web browser itself installed… i.e. I have Chrome on my laptop. This may seem obvious, but I spent a painful few minutes trying to work out what was going on when I first ran a selenium script on a new laptop before realising what the problem was. 😳

Getting started

I’m going to start by importing a bunch of stuff. The reasons will become clear later.

 

import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

from bs4 import BeautifulSoup as bs
import time

Starting a web browser

The first step is initiating the driver for the web browser. If I wanted to be fancy I would create a class to contain functions like this, but for the sake of clarity I’m just going to write individual functions and pass stuff between them here.

Here’s a function to initiate the driver for a web browser:

 

def init_driver():

     # initiate the driver:
     driver = webdriver.Chrome()

     # set a default wait time for the browser [5 seconds here]:
     driver.wait = WebDriverWait(driver, 5)

     return driver

If for some reason you don’t want to put the chromedriver binary in your $PATH you can specify its location in the call, e.g. driver = webdriver.Chrome(/path/to/binary/chromedriver), instead.

Here’s a function to close the driver. We’ll use this at the end of the program.

def close_driver(driver):

	driver.close()

	return

Log in to Twitter

Once you’ve got a driver initiated then you need to do something with it. I’m going to use it to mine historic Twitter data, so the first thing I need to do is log in to Twitter.

On the Twitter login page there are two boxes, one for username and one for password. We’re going to use the selenium driver to submit input to these boxes. To do this we first need to tell selenium which box we’re interested in. This info is contained in the html of the webpage and the easiest way to find it is:

1. open the login page in your browser
2. hover the mouse over a box
3. right click on the box
4. select “inspect element” from the drop down menu

This will open the html for inspection directly in your browser and highlight the bit of the html relating to the box. It should look like this:

login

and you can see that the input class for the username box is js-username-field. Likewise, the password box is js-password-field.

 

def login_twitter(driver, username, password):

	# open the web page in the browser:
	driver.get("https://twitter.com/login")

	# find the boxes for username and password
	username_field = driver.find_element_by_class_name("js-username-field")
	password_field = driver.find_element_by_class_name("js-password-field")

	# enter your username:
	username_field.send_keys(username)
	driver.implicitly_wait(1)

	# enter your password:
	password_field.send_keys(password)
	driver.implicitly_wait(1)

	# click the "Log In" button:
	driver.find_element_by_class_name("EdgeButtom--medium").click()

	return
 

Search Twitter

Once you’re logged in you can enter a search term. I’ve defined mine here as a string called query.

To enter it we again need to find the right box in the html. The “inspect element” method will work in just the same way as before, but normally search boxes are just called "q" in the html (it’s the same for a Google search).

This time though I’m going to change the call slightly because if you’re on a slow internet connection the page might not load super fast after you hit the “Log In” button on the previous page. In this case you want to wait to make sure that the search box has loaded before you go looking for it.

Twitter search results are displayed as an infinite scrolling page, rather than a list of pages, so I’ve also included a loop to scroll down extracting all the search results until they’ve run out. To do it I used an adaptation of this code on stackoverflow to create this function/class:

 

class wait_for_more_than_n_elements_to_be_present(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            elements = EC._find_elements(driver, self.locator)
            return len(elements) > self.count
        except StaleElementReferenceException:
            return False

which I call inside this function:

 

def search_twitter(driver, query):

	# wait until the search box has loaded:
	box = driver.wait.until(EC.presence_of_element_located((By.NAME, "q")))

	# find the search box in the html:
	driver.find_element_by_name("q").clear()

	# enter your search string in the search box:
	box.send_keys(query)

	# submit the query (like hitting return):
	box.submit()

	# initial wait for the search results to load
	wait = WebDriverWait(driver, 10)

	try:
		# wait until the first search result is found. Search results will be tweets, which are html list items and have the class='data-item-id':
		wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))

		# scroll down to the last tweet until there are no more tweets:
		while True:

			# extract all the tweets:
			tweets = driver.find_elements_by_css_selector("li[data-item-id]")

			# find number of visible tweets:
			number_of_tweets = len(tweets)

			# keep scrolling:
			driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])

			try:
				# wait for more tweets to be visible:
				wait.until(wait_for_more_than_n_elements_to_be_present(
					(By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))

			except TimeoutException:
				# if no more are visible the "wait.until" call will timeout. Catch the exception and exit the while loop:
				break

		# extract the html for the whole lot:
		page_source = driver.page_source

	except TimeoutException:

		# if there are no search results then the "wait.until" call in the first "try" statement will never happen and it will time out. So we catch that exception and return no html.
		page_source=None

	return page_source

You can see that I’ve extracted the tweets by searching for their html class. Tweets are returned as html list items with the tag 'li' and class='data-item-id'.

Read info from tweets

Once you’ve extracted the html you can extract information from it in whatever way you like. Personally I tend to use BeautifulSoup and my code looks like this (note you’ll need to install the lxml parser separately to BeautifulSoup if you want to use it):

 

def extract_tweets(page_source):

	soup = bs(page_source,'lxml')

	tweets = []
	for li in soup.find_all("li", class_='js-stream-item'):

		# If our li doesn't have a tweet-id, we skip it as it's not going to be a tweet.
		if 'data-item-id' not in li.attrs:
			continue

		else:
			tweet = {
				'tweet_id': li['data-item-id'],
				'text': None,
				'user_id': None,
				'user_screen_name': None,
				'user_name': None,
				'created_at': None,
				'retweets': 0,
				'likes': 0,
				'replies': 0
			}

			# Tweet Text
			text_p = li.find("p", class_="tweet-text")
			if text_p is not None:
				tweet['text'] = text_p.get_text()

			# Tweet User ID, User Screen Name, User Name
			user_details_div = li.find("div", class_="tweet")
			if user_details_div is not None:
				tweet['user_id'] = user_details_div['data-user-id']
				tweet['user_screen_name'] = user_details_div['data-screen-name']
				tweet['user_name'] = user_details_div['data-name']

			# Tweet date
			date_span = li.find("span", class_="_timestamp")
			if date_span is not None:
				tweet['created_at'] = float(date_span['data-time-ms'])

			# Tweet Retweets
			retweet_span = li.select("span.ProfileTweet-action--retweet > span.ProfileTweet-actionCount")
			if retweet_span is not None and len(retweet_span) > 0:
				tweet['retweets'] = int(retweet_span[0]['data-tweet-stat-count'])

			# Tweet Likes
			like_span = li.select("span.ProfileTweet-action--favorite > span.ProfileTweet-actionCount")
			if like_span is not None and len(like_span) > 0:
				tweet['likes'] = int(like_span[0]['data-tweet-stat-count'])

			# Tweet Replies
			reply_span = li.select("span.ProfileTweet-action--reply > span.ProfileTweet-actionCount")
			if reply_span is not None and len(reply_span) > 0:
				tweet['replies'] = int(reply_span[0]['data-tweet-stat-count'])

			tweets.append(tweet)

	return tweets

Putting it all together

I run all of these functions like this:

 

if __name__ == "__main__":

	# start a driver for a web browser:
	driver = init_driver()

	# log in to twitter (replace username/password with your own):
	username = <<USERNAME>>
	password = <<PASSWORD>>
	login_twitter(driver, username, password)

	# search twitter:
	query = "what ever you want to search for"
	page_source = search_twitter(driver, query)

	# extract info from the search results:
	tweets = extract_tweets(page_source)
	
	# ==============================================
	# add in any other functions here
	# maybe some analysis functions
	# maybe a function to write the info to file
	# ==============================================

	# close the driver:
	close_driver(driver)

And that’s it. Anything you can do by hand in a web browser you can also automate with selenium. It’s not particularly quick or efficient, but it is pretty cool.

9 Replies to “Mining Twitter with Selenium”

  1. Hey, this tutorial is great! Would you happen to have any idea on how I could pull all tweet permalinks within the search results? I’ve been playing around with a lot and haven’t had much success. Current setup looks like this, but obviously it can only pull the first tweet’s permalink – not all of the tweet permalinks on the page (super long spaces necessary because twitter):

    permalink_div = driver.find_element_by_xpath(“””//div[@class=”tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content
    original-tweet js-original-tweet

    “]”””)
    tweet_permalink = permalink_div.get_attribute(“data-permalink-path”)
    print(tweet_permalink)

    Liked by 1 person

  2. Thanks, I’m glad you liked it 🙂

    To grab the permalink I’d probably do something like this (I’ve just adapted the part of the ‘extract_tweets’ function that grabs the user id, user name etc. from above):

    # loop over tweets:
    for li in soup.find_all(“li”, class_=’js-stream-item’):
    # get the user details for each tweet (includes permalinks):
    user_details_div = li.find(“div”, class_=”tweet”)
    if user_details_div is not None:
    # extract permalink:
    tweet[‘permalink’] = user_details_div[‘data_permalink_path’]

    That pulls the permalink for all the tweets on the page for me. I hope it’s helpful.

    I think the issue with your code might be that you’re using “find_element_by_xpath” rather than “find_elements_by_xpath” – the extra “s” is subtle but important! Otherwise you only get a single element returned.

    Like

  3. Hey, really great and clear tutorial. However, the scrolling down does not really work for me. A browser opens, I get logged into twitter, the search term gets entered, but then it only extracts the first 20 tweets which are visible on the first page, nothing more. It doesn’t scroll down. (Except once it did, that was the only time it worked, I do not know why!). Would you happen to have an answer to this?

    Thanks in advance!

    Liked by 1 person

    1. Hi Tobias

      (Sorry for the slow reply – I’ve been travelling and maybe you already worked this out by now.)

      Are you also using the chromedriver binary or are you using a different one?

      Like

  4. thanks for replying, i noticed that i should stay in the page when its open, in this way i get all the tweets i need….
    another question: i want to get the mentions, so i add:
    tweet[‘mentions’] = user_details_div[‘data-mentions’]
    bcz mentions are in ‘data-mentions’ as you know but i get the issue : KeyError: ‘data-mentions’… maybe bcz there are some tweets without mentions

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: