Improved Compressed String Dictionaries

Nieves R. Brisaboa; Ana Cerdeira-Pena; Guillermo de Bernardo; Gonzalo; Navarro

arXiv:1911.08372·cs.DS·November 20, 2019

Improved Compressed String Dictionaries

Nieves R. Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, Gonzalo, Navarro

PDF

Open Access

TL;DR

This paper presents a new family of compressed data structures that efficiently store and query large string dictionaries, achieving better compression and competitive query times for URL collections and RDF datasets.

Contribution

It introduces a novel combination of hierarchical Front-coding and longest-common-prefix ideas for improved string dictionary compression.

Findings

01

Achieves better compression than existing methods.

02

Provides fast query times suitable for large datasets.

03

Effective in Web graph and RDF data applications.

Abstract

We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Network Packet Processing and Optimization