WebCrawling: YouTube Pagination in Python

A while ago I wrote a blog post about how to scrape videos from YouTube. One question I’ve been asked since is how to navigate between different pages of search results. So here’s how.


The pre-amble looks exactly the same:

from bs4 import BeautifulSoup as bs
import requests

base = "https://www.youtube.com/results?search_query="
qstring = "boddingtons+advert"

r = requests.get(base+qstring)

page = r.text



Then we need to find the piece of html that corresponds to the page progress buttons. If you print out the “soup”, the section looks like this:

<a aria-label="Go to page 2" class="yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default" data-sessionlink="itct=CAkQnKQBGAciEwjDhY_x4azXAhUUjBUKHXJHBsso9CQ" data-visibility-tracking="CAkQnKQBGAciEwjDhY_x4azXAhUUjBUKHXJHBsso9CQ" href="/results?sp=SBRQFOoDAA%253D%253D&amp;search_query=boddingtons+advert"><span class="yt-uix-button-content">Next »</span></a>

To find it using BeautifulSoup we can simply specify the ‘class’ as a filter:

buttons = soup.findAll('a',attrs={'class':"yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default"})

There are multiple pagination buttons on the page, for pages 2 – 7 and finally “Next >>”. Each one has its own url, you can print these out like this:

for button in buttons:
	print button['href']

The “Next >>” button is normally what you’re looking for and this is helpfully the last one in the list:

nextbutton = buttons[-1]
print nextbutton['href']

We can navigate to it by invoking the requests.get() function once again.

Then for the blog this.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s