Faster Compact Top-k Document Retrieval

Roberto Konow; Gonzalo Navarro

arXiv:1211.5353·cs.DS·November 26, 2012

Faster Compact Top-k Document Retrieval

Roberto Konow, Gonzalo Navarro

PDF

Open Access

TL;DR

This paper introduces a highly efficient compressed index for top-k document retrieval that significantly reduces space requirements and increases speed compared to previous solutions, using frequency thresholding for compression.

Contribution

It presents a novel compressed index that improves space efficiency and retrieval speed for top-k document queries by replacing suffix tree sampling with frequency thresholding.

Findings

01

Index is up to 25 times faster than previous solutions.

02

Space requirement is reduced to 1.5n to 3n bytes.

03

Achieves practical improvements with minimal space overhead.

Abstract

An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n to 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · DNA and Biological Computing