Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri; Rossano Venturini

arXiv:1806.09447·cs.IR·February 8, 2022

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Rossano Venturini

PDF

1 Repo

TL;DR

This paper introduces space-efficient data structures for large n-gram datasets and a faster algorithm for estimating modified Kneser-Ney language models, significantly reducing storage and computation time.

Contribution

It presents a novel compressed trie for n-gram indexing and an improved estimation algorithm requiring only one external sorting step.

Findings

01

Achieves high space reduction with negligible query time penalty.

02

Reduces estimation time by an average of 4.5X on billions of n-grams.

03

Provides a more efficient method for large-scale language model training.

Abstract

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jermp/tongrams
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.