Compact Indexes for Flexible Top-k Retrieval
Simon Gog, Matthias Petri

TL;DR
This paper presents a self-index based retrieval system that efficiently handles multi-term and phrase queries, significantly reducing ranking time for popular IR relevance measures through innovative reordering and data structures.
Contribution
It generalizes the GREEDY approach to multi-term and phrase queries and introduces the repetition array for improved efficiency and space trade-offs.
Findings
Significant reduction in ranking time for TFxIDF and BM25.
Effective handling of multi-term and phrase queries.
Validated on terabyte-sized IR collections.
Abstract
We engineer a self-index based retrieval system capable of rank-safe evaluation of top-k queries. The framework generalizes the GREEDY approach of Culpepper et al. (ESA 2010) to handle multi-term queries, including over phrases. We propose two techniques which significantly reduce the ranking time for a wide range of popular Information Retrieval (IR) relevance measures, such as TFxIDF and BM25. First, we reorder elements in the document array according to document weight. Second, we introduce the repetition array, which generalizes Sadakane's (JDA 2007) document frequency structure to document subsets. Combining document and repetition array, we achieve attractive functionality-space trade-offs. We provide an extensive evaluation of our system on terabyte-sized IR collections.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Data Management and Algorithms
