Web Crawling Project

A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler's local network, because they are intrinsically shared resources.

In this project we studied how we can build an effective Web crawler that can retrieve "high quality" pages quickly, while maintaining the retrieved pages "fresh." Towards that goal, we identified popular and reasonable definitions for the "importance" of pages and proposed simple algorithms that can identify important pages at the early stage of a crawl [11]. We also explored how we can parallelize a crawling process to maximize the download rate while minimizing the overhead from parallelization [7]. In addition we experimentally and theoretically studied how Web pages change over time and proposed an optimal page refresh policy that maximizes the "freshness" of the retrieved pages [1, 2, 3, 5, 6, 9]. Finally, we investigated automatic ways to download contents from the "Hidden Web" [2], automatic ways to detect mirrors (or replicated collection of pages) from the Web [8] and potential changes to existing HTTP protocol to make the crawling process much more efficient [12].


