Corpus structure, language models, and ad hoc information retrieval

Oren Kurland; Lillian Lee

arXiv:cs/0405044·cs.IR·May 23, 2007

Corpus structure, language models, and ad hoc information retrieval

Oren Kurland, Lillian Lee

PDF

Open Access

TL;DR

This paper introduces a new framework that enhances document language models with corpus structure, leading to improved retrieval precision and recall across multiple datasets.

Contribution

It proposes a novel algorithmic framework incorporating cluster-based information into language models for information retrieval.

Findings

01

Enhanced algorithms outperform standard language models in precision and recall

02

Interpolation algorithm shows statistically significant improvements

03

Framework effective across multiple corpora

Abstract

Most previous work on the recently developed language-modeling approach to information retrieval focuses on document-specific characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in precision and recall, and our new interpolation algorithm posts statistically significant improvements for both metrics over all three corpora tested.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior