Computing n-Gram Statistics in MapReduce
Klaus Berberich, Srikanta Bedathur

TL;DR
This paper explores efficient methods for computing n-gram statistics using MapReduce, including novel algorithms and practical implementation insights, demonstrated through extensive experiments on large text corpora.
Contribution
It introduces a new suffix-based algorithm for n-gram counting in MapReduce and analyzes its efficiency compared to existing methods.
Findings
The suffix-sigma method outperforms traditional approaches in certain scenarios.
Extensions for maximality and closedness can be integrated into the framework.
Experimental results highlight trade-offs among different algorithms.
Abstract
Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-\sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Natural Language Processing Techniques · Algorithms and Data Compression
