Faster Compact Top-k Document Retrieval
Roberto Konow, Gonzalo Navarro

TL;DR
This paper introduces a highly efficient compressed index for top-k document retrieval that significantly reduces space requirements and increases speed compared to previous solutions, using frequency thresholding for compression.
Contribution
It presents a novel compressed index that improves space efficiency and retrieval speed for top-k document queries by replacing suffix tree sampling with frequency thresholding.
Findings
Index is up to 25 times faster than previous solutions.
Space requirement is reduced to 1.5n to 3n bytes.
Achieves practical improvements with minimal space overhead.
Abstract
An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n to 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · DNA and Biological Computing
