Forrest Blogs: How I automated the process of scraping over 3,000 links from my competitor's sites

When I first created Video Lecture Database, I painstakingly added around 450 links to its database by by hand, which took about a day, and was extremely tedious. In some cases, I was able to slightly automate the process by taking a page with some links and transforming its HTML into an SQL query using some fancy text editor tricks, but I had to do it differently for every site, and it was still pretty slow. There was definitely some value in collecting those links though, as many people have visited the site. But I got sick of entering links, and visitors aren't entering anything but links to porn, so my collection of links to streaming video lectures ceased to grow... until now.

Armed with BatchMarklet, I decided to see if I could harvest all of the links from top competing websites, and a bunch more from other sources as well. The results are extremely promising- I gathered a collection of 3,115 links to streaming video lectures in a couple of hours. This post discusses the process of gathering those links.

To begin, I harvested all of the links on Video Lecture Database. To make this easier, I tweaked a few lines of code so that it would output every link on one page, then ran BatchMarklet to get them all. This took about 10 minutes and yielded a copy of the roughly 450 links that I already had.

Next, I went to the Free Science Online Blog and used BatchMarklet to collect all of the newer posts made here since I added most of them to video lecture database. There were a few places were the links were not titled with the subject of the lecture, so BatchMarklet wouldn't work well for these. Instead, I opened each of these in a new tab, then saved them with FeedMarklet. I didn't count how many links I got here, but it took about 15 minutes to scrape all of the content that I wanted.

My next stop was the Spring 2007 MSRI lecture page. It took about 3 minutes to get all of the lectures from this page. I launched BatchMarklet, then checked all, then unchecked the non-lecture links, of which there weren't very many.

After that, I went to VideoLectures. A search for the empty string gave me a page with (no joke), every thing in their database, which included every lecture, speaker, and a bunch of other stuff. I only wanted the lectures, and there were several thousand links on this page, so I didn't want to go through an check or uncheck all of the lecture or non-lecture links. Instead, I selected the section of the results that contained the lectures, pasted this into TextEdit (mac), then saved it as a WebArchive, then opened it and ran BatchMarklet on it- this gave me a list with about 1600 links to video lectures and very few non-video lecture links. It took about 10 minutes to scan through the whole 1600 link list and uncheck a few duplicates, and non-lectures.

My next stop was lecturefox. I couldn't get all of their lectures on one page, but I could get all of them on 9 separate pages. There were very few non-lecture links on these pages, so I BatchMarkleted them, check-all, then removed the few non-lecture links. It took about 5 minutes to extract all 296 links on this site.

I got 6 rather interesting links on physics from WBNL Streaming Video lectures, using a single batch action. There were a lot of links that I didn't want on this page, so I just checked the 6 boxes for the good ones.

From 101 science, I gathered 21 links to the multi-part series 'The Elegant Universe'. Getting the links took one batch action and about 1 minute.

Finally, I went to UC Berkley's Webcasts, which had links to all of their courses broken up on 12 pages, once for each semester. About half of the links on each page were not for lectures. It took about 5 minutes to collect 244 links to course pages containing several lectures each.

In total, I gathered 2575 links into a feed for streaming video lectures, 296 links from lecturefox, and 244 links from uc berkley, for a total of 3,115 links.

Conclusion

Links are the primary asset of link directories such as the ones mentioned above. Link directories can be profitable for their owners, but they need to have lots of good links to attract visitors. Gathering links from around the web and entering them into a database one at a time is slow and tedious. A much faster way to build a directory of links is to use BatchMarklet to streamline and automate the process of scraping those links from other sites.

Forrest Blogs: The Blog of Forrest Briggs

Wednesday, August 22, 2007

How I automated the process of scraping over 3,000 links from my competitor's sites

3 Comments:

Cloud

Who Links To This Page?

More Posts

Links