"Piaf" vs "Adele": classifying encyclopedic queries using automatically labeled training data
Pedro Saleiro, Lu\'is Sarmento

TL;DR
This paper presents a method for classifying web queries as encyclopedic by automatically labeling training data from query logs, achieving high accuracy with specific feature sets, and comparing favorably to Google-based baselines.
Contribution
It introduces a novel approach to automatically label training data for encyclopedic query classification and identifies key features that improve classifier performance.
Findings
Achieved an F1 score above 87% in classifying encyclopedic queries.
Query projections on Wikipedia and Freebase are the most relevant features.
Using frequent positive examples improves classification results.
Abstract
Encyclopedic queries express the intent of obtaining information typically available in encyclopedias, such as biographical, geographical or historical facts. In this paper, we train a classifier for detecting the encyclopedic intent of web queries. For training such a classifier, we automatically label training data from raw query logs. We use click-through data to select positive examples of encyclopedic queries as those queries that mostly lead to Wikipedia articles. We investigated a large set of features that can be generated to describe the input query. These features include both term-specific patterns as well as query projections on knowledge bases items (e.g. Freebase). Results show that using these feature sets it is possible to achieve an F1 score above 87%, competing with a Google-based baseline, which uses a much wider set of signals to boost the ranking of Wikipedia for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Web Data Mining and Analysis · Natural Language Processing Techniques
