Mining Twitter with Selenium

This is great for freaking people out. It looks like a ghost is typing in your web browser.

Web crawling using html parsers to grab links and navigate to new pages with the requests library is all very well, but when you want to physically submit search terms, or login details, or click buttons (etc.) the Selenium library is loads of fun – it literally automates (“drives”) a web browser in real time and you can watch it work (and scrape the pages at the same time).

selenium-logo

It’s pip installable:


pip install selenium

Selenium can automate a range of web browsers, but which ever one you choose you’ll also need a separate binary called a driver. I’m going to use Chrome as my web browser, so I’ll use the chromedriver binary, which I downloaded from here and copied into my /usr/local/bin directory.

Note: you also need to have the web browser itself installed… i.e. I have Chrome on my laptop. This may seem obvious, but I spent a painful few minutes trying to work out what was going on when I first ran a selenium script on a new laptop before realising what the problem was. 😳

Getting started

I’m going to start by importing a bunch of stuff. The reasons will become clear later.

 

import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

from bs4 import BeautifulSoup as bs
import time

Starting a web browser

The first step is initiating the driver for the web browser. If I wanted to be fancy I would create a class to contain functions like this, but for the sake of clarity I’m just going to write individual functions and pass stuff between them here.

Here’s a function to initiate the driver for a web browser:

 

def init_driver():

     # initiate the driver:
     driver = webdriver.Chrome()

     # set a default wait time for the browser [5 seconds here]:
     driver.wait = WebDriverWait(driver, 5)

     return driver

If for some reason you don’t want to put the chromedriver binary in your $PATH you can specify its location in the call, e.g. driver = webdriver.Chrome(/path/to/binary/chromedriver), instead.

Here’s a function to close the driver. We’ll use this at the end of the program.

def close_driver(driver):

	driver.close()

	return

Log in to Twitter

Once you’ve got a driver initiated then you need to do something with it. I’m going to use it to mine historic Twitter data, so the first thing I need to do is log in to Twitter.

On the Twitter login page there are two boxes, one for username and one for password. We’re going to use the selenium driver to submit input to these boxes. To do this we first need to tell selenium which box we’re interested in. This info is contained in the html of the webpage and the easiest way to find it is:

1. open the login page in your browser
2. hover the mouse over a box
3. right click on the box
4. select “inspect element” from the drop down menu

This will open the html for inspection directly in your browser and highlight the bit of the html relating to the box. It should look like this:

login

and you can see that the input class for the username box is js-username-field. Likewise, the password box is js-password-field.

 

def login_twitter(driver, username, password):

	# open the web page in the browser:
	driver.get("https://twitter.com/login")

	# find the boxes for username and password
	username_field = driver.find_element_by_class_name("js-username-field")
	password_field = driver.find_element_by_class_name("js-password-field")

	# enter your username:
	username_field.send_keys(username)
	driver.implicitly_wait(1)

	# enter your password:
	password_field.send_keys(password)
	driver.implicitly_wait(1)

	# click the "Log In" button:
	driver.find_element_by_class_name("EdgeButtom--medium").click()

	return
 

Search Twitter

Once you’re logged in you can enter a search term. I’ve defined mine here as a string called query.

To enter it we again need to find the right box in the html. The “inspect element” method will work in just the same way as before, but normally search boxes are just called "q" in the html (it’s the same for a Google search).

This time though I’m going to change the call slightly because if you’re on a slow internet connection the page might not load super fast after you hit the “Log In” button on the previous page. In this case you want to wait to make sure that the search box has loaded before you go looking for it.

Twitter search results are displayed as an infinite scrolling page, rather than a list of pages, so I’ve also included a loop to scroll down extracting all the search results until they’ve run out. To do it I used an adaptation of this code on stackoverflow to create this function/class:

 

class wait_for_more_than_n_elements_to_be_present(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            elements = EC._find_elements(driver, self.locator)
            return len(elements) > self.count
        except StaleElementReferenceException:
            return False

which I call inside this function:

 

def search_twitter(driver, query):

	# wait until the search box has loaded:
	box = driver.wait.until(EC.presence_of_element_located((By.NAME, "q")))

	# find the search box in the html:
	driver.find_element_by_name("q").clear()

	# enter your search string in the search box:
	box.send_keys(query)

	# submit the query (like hitting return):
	box.submit()

	# initial wait for the search results to load
	wait = WebDriverWait(driver, 10)

	try:
		# wait until the first search result is found. Search results will be tweets, which are html list items and have the class='data-item-id':
		wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))

		# scroll down to the last tweet until there are no more tweets:
		while True:

			# extract all the tweets:
			tweets = driver.find_elements_by_css_selector("li[data-item-id]")

			# find number of visible tweets:
			number_of_tweets = len(tweets)

			# keep scrolling:
			driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])

			try:
				# wait for more tweets to be visible:
				wait.until(wait_for_more_than_n_elements_to_be_present(
					(By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))

			except TimeoutException:
				# if no more are visible the "wait.until" call will timeout. Catch the exception and exit the while loop:
				break

		# extract the html for the whole lot:
		page_source = driver.page_source

	except TimeoutException:

		# if there are no search results then the "wait.until" call in the first "try" statement will never happen and it will time out. So we catch that exception and return no html.
		page_source=None

	return page_source

You can see that I’ve extracted the tweets by searching for their html class. Tweets are returned as html list items with the tag 'li' and class='data-item-id'.

Read info from tweets

Once you’ve extracted the html you can extract information from it in whatever way you like. Personally I tend to use BeautifulSoup and my code looks like this (note you’ll need to install the lxml parser separately to BeautifulSoup if you want to use it):

 

def extract_tweets(page_source):

	soup = bs(page_source,'lxml')

	tweets = []
	for li in soup.find_all("li", class_='js-stream-item'):

		# If our li doesn't have a tweet-id, we skip it as it's not going to be a tweet.
		if 'data-item-id' not in li.attrs:
			continue

		else:
			tweet = {
				'tweet_id': li['data-item-id'],
				'text': None,
				'user_id': None,
				'user_screen_name': None,
				'user_name': None,
				'created_at': None,
				'retweets': 0,
				'likes': 0,
				'replies': 0
			}

			# Tweet Text
			text_p = li.find("p", class_="tweet-text")
			if text_p is not None:
				tweet['text'] = text_p.get_text()

			# Tweet User ID, User Screen Name, User Name
			user_details_div = li.find("div", class_="tweet")
			if user_details_div is not None:
				tweet['user_id'] = user_details_div['data-user-id']
				tweet['user_screen_name'] = user_details_div['data-screen-name']
				tweet['user_name'] = user_details_div['data-name']

			# Tweet date
			date_span = li.find("span", class_="_timestamp")
			if date_span is not None:
				tweet['created_at'] = float(date_span['data-time-ms'])

			# Tweet Retweets
			retweet_span = li.select("span.ProfileTweet-action--retweet > span.ProfileTweet-actionCount")
			if retweet_span is not None and len(retweet_span) > 0:
				tweet['retweets'] = int(retweet_span[0]['data-tweet-stat-count'])

			# Tweet Likes
			like_span = li.select("span.ProfileTweet-action--favorite > span.ProfileTweet-actionCount")
			if like_span is not None and len(like_span) > 0:
				tweet['likes'] = int(like_span[0]['data-tweet-stat-count'])

			# Tweet Replies
			reply_span = li.select("span.ProfileTweet-action--reply > span.ProfileTweet-actionCount")
			if reply_span is not None and len(reply_span) > 0:
				tweet['replies'] = int(reply_span[0]['data-tweet-stat-count'])

			tweets.append(tweet)

	return tweets

Putting it all together

I run all of these functions like this:

 

if __name__ == "__main__":

	# start a driver for a web browser:
	driver = init_driver()

	# log in to twitter (replace username/password with your own):
	username = <<USERNAME>>
	password = <<PASSWORD>>
	login_twitter(driver, username, password)

	# search twitter:
	query = "what ever you want to search for"
	page_source = search_twitter(driver, query)

	# extract info from the search results:
	tweets = extract_tweets(page_source)
	
	# ==============================================
	# add in any other functions here
	# maybe some analysis functions
	# maybe a function to write the info to file
	# ==============================================

	# close the driver:
	close_driver(driver)

And that’s it. Anything you can do by hand in a web browser you can also automate with selenium. It’s not particularly quick or efficient, but it is pretty cool.

One Reply to “Mining Twitter with Selenium”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s