Web Crawling Project

A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler's local network, because they are intrinsically shared resources.

In this project we studied how we can build an effective Web crawler that can retrieve "high quality" pages quickly, while maintaining the retrieved pages "fresh." Towards that goal, we identified popular and reasonable definitions for the "importance" of pages and proposed simple algorithms that can identify important pages at the early stage of a crawl [11]. We also explored how we can parallelize a crawling process to maximize the download rate while minimizing the overhead from parallelization [7]. In addition we experimentally and theoretically studied how Web pages change over time and proposed an optimal page refresh policy that maximizes the "freshness" of the retrieved pages [1, 2, 3, 5, 6, 9]. Finally, we investigated automatic ways to download contents from the "Hidden Web" [2], automatic ways to detect mirrors (or replicated collection of pages) from the Web [8] and potential changes to existing HTTP protocol to make the crawling process much more efficient [12].

Publications

  1. Ka Cheung Sia, Junghoo Cho "Efficient Monitoring Algorithm for Fast News Alert." Technical Report, UCLA Computer Science Department, June 2005.

  2. Alexandros Ntoulas, Petros Zerfos, Junghoo Cho "Downloading Textual Hidden Web Content by Keyword Queries" In Proceedings of the Joint Conference on Digital Libraries (JCDL), June 2005.

  3. Junghoo Cho, Hector Garcia-Molina "Effective page refresh policies for Web crawlers." ACM Transactions on Database Systems, 28(4): December 2003.

  4. Junghoo Cho, Hector Garcia-Molina "Estimating frequency of change." ACM Transactions on Internet Technology, 3(3): August 2003.

  5. Junghoo Cho, Alexandros Ntoulas "Effective Change Detection Using Sampling." In Proceedings of 28th International Conference on Very Large Databases (VLDB), September 2002.

  6. Junghoo Cho, Hector Garcia-Molina "Synchronizing a database to Improve Freshness." In Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD), May 2000.

  7. Junghoo Cho and Hector Garcia-Molina "Parallel Crawlers." In Proceedings of the 11th World Wide Web conference (WWW11), Honolulu, Hawaii, May 2002.

  8. Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina "Finding replicated Web collections." In Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD), May 2000.

  9. Junghoo Cho, Hector Garcia-Molina "The Evolution of the Web and Implications for an incremental Crawler." In Proceedings of 26th International Conference on Very Large Databases (VLDB), September 2000.

  10. Junghoo Cho, Sougata Mukherjea "Crawling Images on the Web." In Proceedings of Third International Conference on Visual Information Systems (Visual99), Amsterdam, The Netherlands, June 1999.

  11. Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient Crawling Through URL Ordering." In Proceedings of the 7th World Wide Web conference (WWW7), Brisbane, Australia, April 1998.

  12. Onn Brandman, Junghoo Cho, Hector Garcia-Molina, Narayanan Shivakumar "Crawler-Friendly Web Servers." In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS), held in conjunction with ACM SIGMETRICS 2000, Santa Clara, California, June 2000.