How I automated the process of scraping over 3,000 links from my competitor's sites
When I first created Video Lecture Database, I painstakingly added around 450 links to its database by by hand, which took about a day, and was extremely tedious. In some cases, I was able to slightly automate the process by taking a page with some links and transforming its HTML into an SQL query using some fancy text editor tricks, but I had to do it differently for every site, and it was still pretty slow. There was definitely some value in collecting those links though, as many people have visited the site. But I got sick of entering links, and visitors aren't entering anything but links to porn, so my collection of links to streaming video lectures ceased to grow... until now.
Armed with BatchMarklet, I decided to see if I could harvest all of the links from top competing websites, and a bunch more from other sources as well. The results are extremely promising- I gathered a collection of 3,115 links to streaming video lectures in a couple of hours. This post discusses the process of gathering those links.
To begin, I harvested all of the links on Video Lecture Database. To make this easier, I tweaked a few lines of code so that it would output every link on one page, then ran BatchMarklet to get them all. This took about 10 minutes and yielded a copy of the roughly 450 links that I already had.
Next, I went to the Free Science Online Blog and used BatchMarklet to collect all of the newer posts made here since I added most of them to video lecture database. There were a few places were the links were not titled with the subject of the lecture, so BatchMarklet wouldn't work well for these. Instead, I opened each of these in a new tab, then saved them with FeedMarklet. I didn't count how many links I got here, but it took about 15 minutes to scrape all of the content that I wanted.
My next stop was the Spring 2007 MSRI lecture page. It took about 3 minutes to get all of the lectures from this page. I launched BatchMarklet, then checked all, then unchecked the non-lecture links, of which there weren't very many.
After that, I went to VideoLectures. A search for the empty string gave me a page with (no joke), every thing in their database, which included every lecture, speaker, and a bunch of other stuff. I only wanted the lectures, and there were several thousand links on this page, so I didn't want to go through an check or uncheck all of the lecture or non-lecture links. Instead, I selected the section of the results that contained the lectures, pasted this into TextEdit (mac), then saved it as a WebArchive, then opened it and ran BatchMarklet on it- this gave me a list with about 1600 links to video lectures and very few non-video lecture links. It took about 10 minutes to scan through the whole 1600 link list and uncheck a few duplicates, and non-lectures.
My next stop was lecturefox. I couldn't get all of their lectures on one page, but I could get all of them on 9 separate pages. There were very few non-lecture links on these pages, so I BatchMarkleted them, check-all, then removed the few non-lecture links. It took about 5 minutes to extract all 296 links on this site.
I got 6 rather interesting links on physics from WBNL Streaming Video lectures, using a single batch action. There were a lot of links that I didn't want on this page, so I just checked the 6 boxes for the good ones.
From 101 science, I gathered 21 links to the multi-part series 'The Elegant Universe'. Getting the links took one batch action and about 1 minute.
Finally, I went to UC Berkley's Webcasts, which had links to all of their courses broken up on 12 pages, once for each semester. About half of the links on each page were not for lectures. It took about 5 minutes to collect 244 links to course pages containing several lectures each.
In total, I gathered 2575 links into a feed for streaming video lectures, 296 links from lecturefox, and 244 links from uc berkley, for a total of 3,115 links.
Conclusion
Links are the primary asset of link directories such as the ones mentioned above. Link directories can be profitable for their owners, but they need to have lots of good links to attract visitors. Gathering links from around the web and entering them into a database one at a time is slow and tedious. A much faster way to build a directory of links is to use BatchMarklet to streamline and automate the process of scraping those links from other sites.
Armed with BatchMarklet, I decided to see if I could harvest all of the links from top competing websites, and a bunch more from other sources as well. The results are extremely promising- I gathered a collection of 3,115 links to streaming video lectures in a couple of hours. This post discusses the process of gathering those links.
To begin, I harvested all of the links on Video Lecture Database. To make this easier, I tweaked a few lines of code so that it would output every link on one page, then ran BatchMarklet to get them all. This took about 10 minutes and yielded a copy of the roughly 450 links that I already had.
Next, I went to the Free Science Online Blog and used BatchMarklet to collect all of the newer posts made here since I added most of them to video lecture database. There were a few places were the links were not titled with the subject of the lecture, so BatchMarklet wouldn't work well for these. Instead, I opened each of these in a new tab, then saved them with FeedMarklet. I didn't count how many links I got here, but it took about 15 minutes to scrape all of the content that I wanted.
My next stop was the Spring 2007 MSRI lecture page. It took about 3 minutes to get all of the lectures from this page. I launched BatchMarklet, then checked all, then unchecked the non-lecture links, of which there weren't very many.
After that, I went to VideoLectures. A search for the empty string gave me a page with (no joke), every thing in their database, which included every lecture, speaker, and a bunch of other stuff. I only wanted the lectures, and there were several thousand links on this page, so I didn't want to go through an check or uncheck all of the lecture or non-lecture links. Instead, I selected the section of the results that contained the lectures, pasted this into TextEdit (mac), then saved it as a WebArchive, then opened it and ran BatchMarklet on it- this gave me a list with about 1600 links to video lectures and very few non-video lecture links. It took about 10 minutes to scan through the whole 1600 link list and uncheck a few duplicates, and non-lectures.
My next stop was lecturefox. I couldn't get all of their lectures on one page, but I could get all of them on 9 separate pages. There were very few non-lecture links on these pages, so I BatchMarkleted them, check-all, then removed the few non-lecture links. It took about 5 minutes to extract all 296 links on this site.
I got 6 rather interesting links on physics from WBNL Streaming Video lectures, using a single batch action. There were a lot of links that I didn't want on this page, so I just checked the 6 boxes for the good ones.
From 101 science, I gathered 21 links to the multi-part series 'The Elegant Universe'. Getting the links took one batch action and about 1 minute.
Finally, I went to UC Berkley's Webcasts, which had links to all of their courses broken up on 12 pages, once for each semester. About half of the links on each page were not for lectures. It took about 5 minutes to collect 244 links to course pages containing several lectures each.
In total, I gathered 2575 links into a feed for streaming video lectures, 296 links from lecturefox, and 244 links from uc berkley, for a total of 3,115 links.
Conclusion
Links are the primary asset of link directories such as the ones mentioned above. Link directories can be profitable for their owners, but they need to have lots of good links to attract visitors. Gathering links from around the web and entering them into a database one at a time is slow and tedious. A much faster way to build a directory of links is to use BatchMarklet to streamline and automate the process of scraping those links from other sites.
3 Comments:
ray ban sunglasses outlet
christian louboutin shoes
ray ban sunglasses
michael kors outlet
louis vuitton outlet stores
adidas yeezy
louis vuitton outlet
coach factory outlet
tiffany jewelry
nike air max 90
adidas originals
kobe 9
hollister clothing store
coach factory outlet
nike trainers
nike huarache shoes
adidas outlet store
asics shoes
adidas originals shoes
ugg boots
ralph lauren
coach outlet online
lebron james shoes 13
michael kors handbags
tiffany outlet
burberry outlet online
beats wireless headphones
gucci outlet
louis vuitton handbags
nike roshe runs
coach outlet
coach outlet
louis vuitton handbags
michael kors outlet online
ray ban sunglasses
cheap jerseys
coach outlet
tory burch flats
toms shoes
louis vuitton
20164.28wengdongdong
zhengjx20160702
coach factory outlet online
oakley canada
ralph lauren outlet
replica rolex watches
adidas uk
coach outlet canada
jordan 11
retro 11
cheap basketball shoes
cheap oakley sunglasses
rolex watches
coach outlet store online clearances
copy watches
oakley sunglasses outlet
jordan 3s
ray bans
ralph lauren
kd 7 shoes
coach factory outlet
cheap ray ban sunglasses
kate spade handbags
oakley vault
true religion outlet store
coach outlet
oakley outlet
gucci handbags
coach outlet
nfl jerseys
cheap toms shoes
christian louboutin shoes
nike roshe run
louboutin shoes
kd 8
michael kors outlet
hollister clearance
ralph lauren
michael kors outlet
ray ban sunglasses outlet
lebron 12
شركة نقل اثاث بسيهات
ارخص شركة نقل عفش بالخبر
شركة نقل اثاث بالقطيف
ارخص شركة نقل عفش بالدمام
Post a Comment
<< Home