Writing a Web-Parser(or a Web-Crawler?) in Ruby

A crawler is something that is use to dig deep inside a web-page. When you want to move from one page to another, then to another, you would write a web-crawler. But when you want to collect some data (or some information from the html), you would write a web-parser.
An easy way to remember what a web-crawler does is, just think of spiders and what do they do, they just crawl around places and that is exactly what a web-crawler does.

Whenever I search for videos on Youtube, I usually end up watching all the videos(at least the results on the first page). So, suppose on the first page there are 10 videos which show up in results and each video is of say 10 minutes.

Consider the fact that I have an internet connection which gives around 200kBps download speed and is also shared by 4 people. So when I play any video, it takes me at least 3 minutes to let the video stream and then another 10 minute (that’s how long the video is). So I end up spending 13 minutes to watch one video of 10 minutes. Now if I want to watch 10 such videos on the trot, it will take me 130 minutes to actually watch content of 100 minutes (and also waste bandwidth in office :)). And here I have not even considered the time we all have to waste on watching those adds before the video actually start to stream. Although Youtube is picky about in which videos to put the ad in. If you search for videos on “RubyConf India”, you would not find any ad in any of the videos. But if you search something like “Movie Trailers”, you would find ad’s in probably every other video(smart move there!!). Ofcourse, there are ad-blockers out there now, but it still is a waste of time watching videos on youtube.

So the time you actually spend on watching a video is much more than the length of the video. So, this is the code I wrote to crawl and download youtube videos.


  require 'rubygems'
  require 'nokogiri'
  require 'open-uri'

  urls = []

  #pass the search terms here
  search_list = ["football", "manchester united", "amazing tennis", "radical something", "Just for Laugh"]

  search_list.each do |search_term|
    list_url = "http://www.youtube.com/results?search_query=#{search_term.split(' ').join('+')}"

  page = Nokogiri::HTML(open (list_url))
  p '-------------------------------------------------------------------------'
  p "finding videos for #{search_term}"

  div_elements = page.css('li.yt-lockup2 a')

  div_elements.each do |d|
    watch_code = d.attributes['href'].value
    urls << "http://www.youtube.com" + watch_code if watch_code.include?('watch')

  p 'the links are'
  p urls
  p 'total videos found'
  p urls.count

  p '---------------download start -----------------------------------------------'
  urls.each do |link_to_video|
    system("youtube-dl -t #{link_to_video}")
  p '---------------download completed-------------------------------------------'


If you are still on Ruby 1.8, you have to require ‘rubygems’.If you are on higher version of ruby (1.9 or higher), that first line is not needed. Then I have required ‘nokogiri’.
Nokogiri is a ruby gem which parses HTML/XML document very fast. It also has support of finding elements via css-selector and parsing via X-Path.

There are some options other than Nokogiri for web-crawling (ofcourse ruby based), there is Hpricot. Although it is not maintained anymore by the creators of the gem. Then there is another gem ‘Ox’. It is used for xml parsing. So I figured that Nokogiri is the best option in current scenario.

I have initialized an empty array ‘urls’ to maintain the the urls of the search results.
I have also then passed the list of all search terms and iterated over each term to collect all the results and push them in ‘urls’ array.

Now if you search for a video on youtube, it has a fixed format of the url which get generated. For instance, you search for “ruby”, the url which would get generated would be like this,
and if you would search for multiple terms with like “ruby tutorials”, the url which would get generated of the search results page would be like this,

Once I have created the search result page url, I use Nokogiri HTML module to open the url and get the result in ‘page’. Now, in Nokogiri, you can parse the html via css-selector or the passing the Xpath. So, what does this line mean?
div_elements = page.css(‘li.yt-lockup2 a’)

The search results are all under an ‘ul’ tag as ‘li’ elements. And all the ‘li’ elements which are of interest to me have a class ‘yt-lockup2’. Under the li tag, there is an ‘a’ tag whose href value is the actual link for video(not the whole link, but part of the link.). So, the above line would return an array of all the a tags.

Since the a tag did not give us the whole url, next step is to generate the actual url for the videos (ofcourse without opening them). Again if you open any video on youtube, the generated url is like this,
where ‘m10xcPcuBeg’ is the unique identifier for this particular video.

So while generating the urls, I have also checked a condition whether the href value contains string ‘watch’. The reason for checking the that is, when you do a search on youtube, it also gives you some user pages, some channel pages along with actual videos. So to avoid processing the user pages and channel pages, I have put that condition. Now I have successfully collected all the result video’s urls.

Next step is to download them. To download videos from youtube, I could have used a youtube downloader software. But then what would be the use of above code and I also have to manually enter the url every time. So to avoid all this, I have used a system-utility “youtube-dl”. youtube-dl is a small command-line program to download videos from YouTube.com and a few more sites. Here is the link for downloading it. http://rg3.github.io/youtube-dl/download.html

I have iterated over the urls and passed them as paramtere to youtube-dl.
So I passed up the above search terms, and ended up downloading 8gb of videos and no more wasting time on streaming. Mission Accomplished.

But the question is did I wrote a web-crawler or a web-parser? I like to think that it is a combination of both since I am searching for videos of my interest on youtube and then parsing the data from the html to create the links dynamically and finally load the videos. So I like to call this a “Craw-ser” (getting the best of both world). But to be honest, it is more of a web-parser than a web-crawler.

That’s it.!!! Here is the link to github page https://github.com/rishijain/youtube-crawser

Enjoy ‘craw-sing’ 🙂