A Polya Urn Document Language Model for Improved Information Retrieval
Ronan Cummins, Jiaul Hoque Paik, and Yuanhua Lv

TL;DR
This paper introduces a Polya urn-based language model that captures term burstiness, improving information retrieval effectiveness and robustness over traditional multinomial models, with theoretical and empirical validation.
Contribution
It develops a Dirichlet compound multinomial language model using a Polya process to model word burstiness, enhancing retrieval performance and robustness.
Findings
Significantly outperforms state-of-the-art language models on TREC collections.
Model tuning is more robust compared to multinomial models.
The model aligns with the verbosity hypothesis and relates to tf-idf schemes.
Abstract
The multinomial language model has been one of the most effective models of retrieval for over a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term-dependency, that is the tendency of a term to repeat itself within a document (i.e. word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Polya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly. We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Expert finding and Q&A systems
