WebCrawling: YouTube Pagination in Python

A while ago I wrote a blog post about how to scrape videos from YouTube. One question I’ve been asked since is how to navigate between different pages of search results. So here’s how.

YouTube

The pre-amble looks exactly the same:

from bs4 import BeautifulSoup as bs
import requests

base = "https://www.youtube.com/results?search_query="
qstring = "boddingtons+advert"

r = requests.get(base+qstring)

page = r.text
soup=bs(page,'html.parser')

 

Pagination

Then we need to find the piece of html that corresponds to the page progress buttons. If you print out the “soup”, the section looks like this:

<a aria-label="Go to page 2" class="yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default" data-sessionlink="itct=CAkQnKQBGAciEwjDhY_x4azXAhUUjBUKHXJHBsso9CQ" data-visibility-tracking="CAkQnKQBGAciEwjDhY_x4azXAhUUjBUKHXJHBsso9CQ" href="/results?sp=SBRQFOoDAA%253D%253D&amp;search_query=boddingtons+advert"><span class="yt-uix-button-content">Next »</span></a>

To find it using BeautifulSoup we can simply specify the ‘class’ as a filter:

buttons = soup.findAll('a',attrs={'class':"yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default"})

There are multiple pagination buttons on the page, for pages 2 – 7 and finally “Next >>”. Each one has its own url, you can print these out like this:

for button in buttons:
	print button['href']

The “Next >>” button is normally what you’re looking for and this is helpfully the last one in the list:

nextbutton = buttons[-1]
print nextbutton['href']

We can navigate to it by invoking the requests.get() function once again.

Then for the blog this.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s