TL;DR
This paper introduces a compressed suffix tree-based language model that is highly memory-efficient and supports fast, exact probability queries, outperforming existing methods in large-scale high-order n-gram modeling.
Contribution
The authors develop a novel language modeling approach using compressed suffix trees, achieving significant query speedups and lower memory usage compared to state-of-the-art models.
Findings
Query runtimes improved up to 2500x
Lower memory requirements by orders of magnitude
Competitive with KenLM in training and querying speeds
Abstract
Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
