Corpus structure, language models, and ad hoc information retrieval
Oren Kurland, Lillian Lee

TL;DR
This paper introduces a new framework that enhances document language models with corpus structure, leading to improved retrieval precision and recall across multiple datasets.
Contribution
It proposes a novel algorithmic framework incorporating cluster-based information into language models for information retrieval.
Findings
Enhanced algorithms outperform standard language models in precision and recall
Interpolation algorithm shows statistically significant improvements
Framework effective across multiple corpora
Abstract
Most previous work on the recently developed language-modeling approach to information retrieval focuses on document-specific characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in precision and recall, and our new interpolation algorithm posts statistically significant improvements for both metrics over all three corpora tested.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior
