Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

TL;DR
Infini-gram mini is a scalable, efficient system leveraging FM-indexes to enable exact n-gram search on petabyte-scale Internet text corpora, significantly reducing storage and improving search speed.
Contribution
It introduces an optimized FM-index based system capable of indexing and searching petabyte-scale text data efficiently, with substantial improvements over previous implementations.
Findings
Indexed 83TB of Internet text in 99 days with a single CPU node.
Discovered high contamination levels in major language model benchmarks.
Provides a web interface and API for large-scale text search.
Abstract
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18) and memory use during both indexing (3.2 reduction) and querying (down to a negligible amount). We index 83TB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
