File Harvest: Targeted, Legal Crawling and Downloading of Online Media
Loading...
Date
2009-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
The Ohio State University
Abstract
Today's internet user has a limited amount of time to manually mine the Internet for content such as videos, images, and documents that they want to view. Much of the user's time is wasted overhead: clicking hyperlinks, waiting for pages to load, and actually downloading the content for offline viewing. Therefore, many users would benefit from an application that could automatically crawl and download a large amount of content from the Internet, so that users could browse and further filter the content offline at a much faster speed and without the unnecessary overhead. I have developed a web crawling and downloading program, File Harvest - written in C# and using the .NET framework - that allows the user to quickly configure the web crawling mechanism before starting it. The web crawler functions by following hyperlinks and examining each page it encounters along the way. The user specifies what web pages to crawl, how many levels of hyperlinks to crawl, and what types of content to download. The primary insight of the work is the value of combining crawling and downloading in a single program – something that related efforts have yet to do. The program uses various web page analysis techniques such as HTTP traffic proxying and static analysis of the page HTML to help the user find as much relevant content as possible to download. There are some limitations as to what can be found through crawling, and these limitations are the primary focus of the research going forward. In general, File Harvest can greatly expedite the discovery and downloading of media for users.
Description
1st Place - Denman Undergraduate Research Forum 2009, Engineering
Keywords
web crawling, mass downloading, online media, flash video, http proxy, C# .NET Windows