Web Crawling Project
A crawler is a program that
retrieves and stores pages from the Web, commonly for a Web search
engine. A crawler often has to download hundreds of millions of pages
in a short period of time and has to constantly monitor and refresh
the downloaded pages. In addition, the crawler should avoid putting
too much pressure on the visited Web sites and the crawler's local
network, because they are intrinsically shared resources.
In this project we studied how we can build an effective Web crawler
that can retrieve "high quality" pages quickly, while maintaining
the retrieved pages "fresh." Towards that goal, we identified
popular and reasonable definitions for the "importance" of pages
and proposed simple algorithms that can identify important pages
at the early stage of a
crawl [11].
We also explored how we can parallelize a crawling process to
maximize the download rate while minimizing the overhead from
parallelization [7]. In addition we experimentally and theoretically
studied how Web pages change over time and proposed an optimal page
refresh policy that maximizes the "freshness" of the retrieved pages
[1,
2,
3,
5,
6,
9].
Finally, we investigated automatic ways to download contents from the "Hidden Web"
[2], automatic ways to detect mirrors
(or replicated collection of pages) from the Web
[8]
and potential changes to existing HTTP protocol to make
the crawling process much more efficient
[12].
Publications
- Ka Cheung Sia, Junghoo Cho
"Efficient Monitoring Algorithm for Fast News Alert."
Technical Report, UCLA Computer Science Department, June 2005.
- Alexandros Ntoulas, Petros Zerfos, Junghoo Cho
"Downloading Textual Hidden Web Content by Keyword Queries"
In Proceedings of the Joint Conference on Digital Libraries (JCDL), June 2005.
- Junghoo Cho, Hector Garcia-Molina
"Effective page refresh policies for Web crawlers."
ACM Transactions on Database Systems, 28(4): December 2003.
- Junghoo Cho, Hector Garcia-Molina
"Estimating frequency of change."
ACM Transactions on Internet Technology, 3(3): August 2003.
- Junghoo Cho, Alexandros Ntoulas
"Effective Change Detection
Using Sampling."
In Proceedings of 28th International Conference on
Very Large Databases (VLDB), September 2002.
- Junghoo Cho, Hector Garcia-Molina "Synchronizing
a database to Improve Freshness." In Proceedings of 2000 ACM
International Conference on Management of Data (SIGMOD), May 2000.
- Junghoo Cho and Hector Garcia-Molina "Parallel Crawlers." In Proceedings of the 11th World
Wide Web conference (WWW11), Honolulu, Hawaii, May 2002.
- Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina "Finding
replicated Web collections." In Proceedings of 2000 ACM
International Conference on Management of Data (SIGMOD), May 2000.
- Junghoo Cho, Hector Garcia-Molina "The Evolution
of the Web and Implications for an incremental Crawler."
In Proceedings of 26th International Conference on
Very Large Databases (VLDB), September 2000.
- Junghoo Cho, Sougata Mukherjea "Crawling Images on
the Web." In Proceedings of Third International
Conference on Visual Information Systems (Visual99), Amsterdam, The
Netherlands, June 1999.
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient
Crawling Through URL Ordering." In Proceedings of the 7th World
Wide Web conference (WWW7), Brisbane, Australia, April 1998.
- Onn Brandman, Junghoo Cho, Hector Garcia-Molina, Narayanan
Shivakumar "Crawler-Friendly
Web Servers." In Proceedings of the Workshop on Performance and
Architecture of Web Servers (PAWS), held in conjunction with ACM
SIGMETRICS 2000, Santa Clara, California, June 2000.