Alexandros Ntoulas is Researcher at
Microsoft Research in Mountain View, California, where he is
working on integrated search of structured and unstructured data
and entity extraction.
He received a Ph.D. degree in Computer Science from the University of
California Los Angeles (UCLA) in 2006 and
a B.Sc. from the Computer Engineering and Informatics Department (CEID) of
Patras University, Greece in 2000.
His area of expertise is Databases and Web information systems and
his research interests are in the study of
systems and algorithms that facilitate the monitoring, collection, management,
mining and searching of information on the World Wide Web.
He publishes research papers in international conference
proceedings and journals and has three patents pending. He is the co-founder of
Infocious
(now Lingo Semantics), a Web
Search Engine that applies linguistic analysis techniques in order to provide
highly relevant results and a better search experience to the user.
He is the recipient of the Best Paper
Award for the ICDE 2005 conference.
The Web has brought together a wide variety of digital information into
publicly accessible media. However, because of the sheer quantity and varying
quality of available information, the user often feels overwhelmed and
disoriented during his pursuit of information.
My research focuses on the monitoring, collection, sharing, mining and
searching of information in order to help the users identify and extract
the desired information quickly and effectively through simple and intuitive methods.
Here is a list of the projects that I have worked on:
Hidden-Web Crawling: Search engines employ automated programs called
crawlers to download pages from the Web. Typical crawlers today follow links
from one Web page to another and download every page in their path. However, an
ever-increasing amount of information on the Web is accessible only through
search interfaces; such information is called the Hidden or Deep Web. For
example, in PubMed (www.pubmed.org) users can access pages of high-quality
papers on medical research only after issuing a set of keywords. Since there
are no static links to the Hidden-Web pages, current search engines cannot
index them, thus depriving users from accessing potentially valuable
information. In my research, I studied how we can build an effective Hidden-Web
crawler that can autonomously discover and download pages from the Hidden Web.
[
JCDL 2005 |
extended version |
slides
]
Indexing Optimizations: Search engines typically create and maintain
large-scale indexes that are used to answer thousands of user queries per
second. Given the vast amount of information available on the Web, such indexes
can easily grow very large and become very costly to operate. In my research, I
proposed and evaluated algorithms for reducing the size of an index, without
sacrificing the quality of results that we return to the users.
[
SIGIR 2007 |
slides
]
Web Spam Detection: One of the most important goals of Web search
engines is to return highly relevant results to the users. However, given the
potential monetary value of the traffic that search engines direct to Web sites,
some Web site operators craft spam Web pages that are useless to human users
and exist for the sole purpose of fooling the search engine rankings into
returning such pages, in the hope of attracting traffic. At Microsoft Research,
we studied the characteristics of Web spam and proposed fast and highly
accurate algorithms for removing spam from the search engine results.
[
WWW 2006 |
slides ]
Data Synchronization: Information on the Web is constantly updated.
Therefore, once the search engine’s crawler has downloaded pages and stored
them locally, the crawler has to refresh the pages periodically. In my research,
I performed large-scale experimental studies on several million pages
collected weekly from the Web over a period of one year. We induced models that
capture the evolution of Web sites and Web-accessible textual databases. The
models were then used to predict when we should refresh the Web pages.
Additionally, since the enormous size of the Web limits most crawlers to
downloading only a subset of the entire Web, I studied sampling-based
algorithms for determining which subset of the Web the crawler should focus on.
[
VLDB 2002 ,
slides
|
WWW 2004 ,
slides
|
ICDE 2005 ,
slides
]
The Infocious Web Search Engine: As part of my research, I worked on the
implementation of a full-fledged commercial Web search engine called Infocious, which blends my research in
crawling, data synchronization and indexing along with a variety of natural
language processing (NLP) techniques in order to improve the quality of results
presented to the users. The search engine
performes highly efficient crawling and indexing of Web data, operates in a
distributed fashion over a cluster of commodity machines,
provides failover capabilities that guarantee 24/7 availability of the service,
and gracefully scales to the size of the Web, currently indexing more than 2
billion pages.
[
WWW 2005 |
slides
]
Automatic Web Directory Construction: Web Directories provide an
alternative (to search engines) way of locating relevant information on the
Web. Typically, Web Directories rely on humans putting in significant time and
effort into finding important pages on the Web and categorizing them in the
Directory. I studied ways for automating the creation of a Web
Directory by assigning every page from a given collection of
pages to a given subject hierarchy. Our method is based on the identification
of important sequences of terms within Web pages (called lexical chains),
which are then used to assign the pages to categories in the hierarchy.
[
APWeb 2006
]
CS 143 - Database Systems Spring Quarter 2006 |
Teaching Associate.
Computer Science Department, University of California Los Angeles.
|
CSM 152B - Digital Design Laboratory Academic year 2001-2002 |
Teaching Assistant.
Computer Science Department, University of California Los Angeles.
|
134 - Introduction to Software Systems II (Java)
Spring Semester 2001 |
Teaching Assistant.
Computer Engineering and Informatics Department, University of Patras.
|
131 - Introduction to Software Systems I (C/C++) Fall Semester 2000 |
Teaching Assistant.
Computer Engineering and Informatics Department, University of Patras.
|
131E - Software Laboratory Fall Semester 2000 |
Teaching Assistant.
Computer Engineering and Informatics Department, University of Patras.
|
| Best Paper Award, ICDE 2005 |
Best Paper Award in the 2005 IEEE International Conference on Data Engineering.
|
| Gerondelis Foundation, 2005 |
Academic Honor (with financial award) for academic excellence.
|
| Chorafas Foundation, 2005 |
Academic Honor (with financial award) for academic excellence.
|
| Propondis Foundation, 2004 |
Academic Honor (with financial award) for academic excellence.
|
University of California Regents,
Academic years 2002-2003, 2003-2004 |
Graduate Student Fellowship covering tuition, fees and stipend.
|
Technical Chamber of Greece,
Academic year 2000-2001 |
Honorary Distinction Award for academic excellence.
|