# Bootstrapping Domain-Specific Content Discovery on the Web

**Authors:** Kien Pham, A\'ecio Santos, Juliana Freire

arXiv: 1902.09667 · 2019-02-27

## TL;DR

DISCO is a bootstrap approach for domain-specific web content discovery that combines multiple strategies to efficiently find relevant websites, outperforming existing methods in various social-good domains.

## Contribution

The paper introduces DISCO, a novel framework that systematically combines multiple discovery strategies to improve domain-specific web content harvesting from limited initial data.

## Key findings

- DISCO achieves high coverage and harvest rates across multiple domains.
- It outperforms state-of-the-art methods in experiments.
- Effective for social-good domain content discovery.

## Abstract

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest $D$, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in $D$, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites. DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.09667/full.md

## Figures

34 figures with captions in the complete paper: https://tomesphere.com/paper/1902.09667/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/1902.09667/full.md

---
Source: https://tomesphere.com/paper/1902.09667