Computing n-Gram Statistics in MapReduce

Klaus Berberich; Srikanta Bedathur

arXiv:1207.4371·cs.IR·July 19, 2012

Computing n-Gram Statistics in MapReduce

Klaus Berberich, Srikanta Bedathur

PDF

Open Access

TL;DR

This paper explores efficient methods for computing n-gram statistics using MapReduce, including novel algorithms and practical implementation insights, demonstrated through extensive experiments on large text corpora.

Contribution

It introduces a new suffix-based algorithm for n-gram counting in MapReduce and analyzes its efficiency compared to existing methods.

Findings

01

The suffix-sigma method outperforms traditional approaches in certain scenarios.

02

Extensions for maximality and closedness can be integrated into the framework.

03

Experimental results highlight trade-offs among different algorithms.

Abstract

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-\sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Natural Language Processing Techniques · Algorithms and Data Compression