# Fast, Small, and Simple Document Listing on Repetitive Text Collections

**Authors:** Dustin Cobas, Gonzalo Navarro

arXiv: 1902.07599 · 2019-02-21

## TL;DR

This paper introduces a simple, space-efficient document listing index tailored for repetitive text collections, significantly improving retrieval speed and space usage by exploiting document array repetitiveness and grammar compression.

## Contribution

The paper presents a novel document listing index that leverages grammar compression of the document array to efficiently handle highly repetitive collections, outperforming existing methods.

## Key findings

- Index achieves faster query times compared to alternatives.
- Significant reduction in space usage for repetitive collections.
- Effective exploitation of document array repetitiveness through grammar compression.

## Abstract

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length $n$ that lists the $ndoc$ distinct documents where a pattern of length $m$ appears in time $\mathcal{O}(m+ndoc \cdot \log n)$. We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.07599/full.md

## Figures

19 figures with captions in the complete paper: https://tomesphere.com/paper/1902.07599/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/1902.07599/full.md

---
Source: https://tomesphere.com/paper/1902.07599