Fast, Small and Exact: Infinite-order Language Modelling with Compressed   Suffix Trees

Ehsan Shareghi; Matthias Petri; Gholamreza Haffari; Trevor Cohn

arXiv:1608.04465·cs.CL·August 17, 2016

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn

PDF

1 Repo

TL;DR

This paper introduces a compressed suffix tree-based language model that is highly memory-efficient and supports fast, exact probability queries, outperforming existing methods in large-scale high-order n-gram modeling.

Contribution

The authors develop a novel language modeling approach using compressed suffix trees, achieving significant query speedups and lower memory usage compared to state-of-the-art models.

Findings

01

Query runtimes improved up to 2500x

02

Lower memory requirements by orders of magnitude

03

Competitive with KenLM in training and querying speeds

Abstract

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eehsan/cstlm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.