Harnessing the Deep Web: Present and Future
Jayant Madhavan (Google Inc.), Loredana Afanasiev (Universiteit van, Amsterdam), Lyublena Antova (Cornell University), Alon Halevy (Google)

TL;DR
This paper discusses the development of a system that exposes large volumes of Deep-Web content to search engines, compares different approaches to accessing this data, and highlights future research directions in web data integration.
Contribution
It presents a system for surfacing Deep-Web content to search engines and analyzes the strengths and weaknesses of surfacing versus virtual integration approaches.
Findings
The system exposes over 1000 queries per second in 50+ languages.
Comparison of surfacing and virtual integration approaches.
Identification of future research areas in Deep-Web data analysis.
Abstract
Over the past few years, we have built a system that has exposed large volumes of Deep-Web content to Google.com users. The content that our system exposes contributes to more than 1000 search queries per-second and spans over 50 languages and hundreds of domains. The Deep Web has long been acknowledged to be a major source of structured data on the web, and hence accessing Deep-Web content has long been a problem of interest in the data management community. In this paper, we report on where we believe the Deep Web provides value and where it does not. We contrast two very different approaches to exposing Deep-Web content -- the surfacing approach that we used, and the virtual integration approach that has often been pursued in the data management literature. We emphasize where the values of each of the two approaches lie and caution against potential pitfalls. We outline important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Web Application Security Vulnerabilities · Advanced Malware Detection Techniques
