File Harvest: Targeted, Legal Crawling and Downloading of Online Media

Loading...
Thumbnail Image

Date

2009-06

Journal Title

Journal ISSN

Volume Title

Publisher

The Ohio State University

Research Projects

Organizational Units

Journal Issue

Abstract

Today's internet user has a limited amount of time to manually mine the Internet for content such as videos, images, and documents that they want to view. Much of the user's time is wasted overhead: clicking hyperlinks, waiting for pages to load, and actually downloading the content for offline viewing. Therefore, many users would benefit from an application that could automatically crawl and download a large amount of content from the Internet, so that users could browse and further filter the content offline at a much faster speed and without the unnecessary overhead. I have developed a web crawling and downloading program, File Harvest - written in C# and using the .NET framework - that allows the user to quickly configure the web crawling mechanism before starting it. The web crawler functions by following hyperlinks and examining each page it encounters along the way. The user specifies what web pages to crawl, how many levels of hyperlinks to crawl, and what types of content to download. The primary insight of the work is the value of combining crawling and downloading in a single program – something that related efforts have yet to do. The program uses various web page analysis techniques such as HTTP traffic proxying and static analysis of the page HTML to help the user find as much relevant content as possible to download. There are some limitations as to what can be found through crawling, and these limitations are the primary focus of the research going forward. In general, File Harvest can greatly expedite the discovery and downloading of media for users.

Description

1st Place - Denman Undergraduate Research Forum 2009, Engineering

Keywords

web crawling, mass downloading, online media, flash video, http proxy, C# .NET Windows

Citation