WebArchive Project

The goal of this project is to track and store the history of the Web. We believe that the Web history data will be very useful for many disciplines:

Preservation of Human Knowledge: An ever increasing amount of information is available exclusively on the Web. At the same time, some of this information constantly disappears over time. How many times had we bookmarked a page, but the page disappeared all of a sudden? Unless we store and archive the evolving Web, humanity may continue to lose valuable knowledge that may have taken several decades to discover.
Study of Knowledge Evolution: Using Web history data, we can study how human knowledge evolves over time. For example, we may be able to study how Linux project evolved over the last 10 years by analyzing Linux-related Web pages for that period.
Testbed for Web-related Research: A lot of Web-related research, including Web crawling or Web searching, will benefit if Web history data is available. Researchers can test their ideas and algorithms on real Web data in the archive system.

In order to store the history of the Web, we need to address many interesting technical challenges, including the following:

Efficient Change Detection Policy:When we need to maintain a large number of pages that constantly change, we need to carefully select when and how often revisit them. Otherwise, we may waste a significant portion of our download resources, revisiting unchanged pages again and again.
Compact Storage of Web Pages: The amount of data that we need to store is extremely large, when we store multiple versions of Web pages. Is there a good way to compress and organize multiple versions of Web pages?
Efficient Index for History Data: To access the history data efficiently, we need to build an index over the multiple versions of Web pages. What will be the right index structure for this dataset?

As a testbed for this project, we are currently storing the history of the blogs available on the Web.

Publications

Ka Cheung Sia, Junghoo Cho "Efficient Monitoring Algorithm for Fast News Alert." Technical Report, UCLA Computer Science Department, June 2005.
Alexandros Ntoulas, Petros Zerfos, Junghoo Cho "Downloading Textual Hidden Web Content by Keyword Queries. In Proceedings of the Joint Conference on Digital Libraries (JCDL), June 2005.
Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis Gravano "Modeling and Managing Content Changes in Text Databases." In Proceedings of the International Conference on Data Engineering (ICDE), March 2005.
Alexandros Ntoulas, Junghoo Cho, Christopher Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." In Proceedings of the World-Wide Web Conference (WWW), May 2004.
Junghoo Cho, Sourashis Roy "Impact of Web Search Engines on Page Popularity." In Proceedings of the World-Wide Web Conference (WWW), May 2004.
Junghoo Cho, Alexandros Ntoulas "Effective Change Detection Using Sampling." In Proceedings of 28th International Conference on Very Large Databases (VLDB), September 2002.

Junghoo (John) Cho, cho@cs.ucla.edu