# Comparison of Modified Kneser-Ney and Witten-Bell Smoothing Techniques   in Statistical Language Model of Bahasa Indonesia

**Authors:** Ismail Rusli

arXiv: 1706.07786 · 2017-06-26

## TL;DR

This study compares Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical language models for Bahasa Indonesia, demonstrating that Modified Kneser-Ney generally yields lower perplexity across various n-gram models.

## Contribution

It provides the first large-scale comparison of smoothing techniques for Bahasa Indonesia, showing the superior performance of Modified Kneser-Ney smoothing in this language.

## Key findings

- Modified Kneser-Ney outperforms Witten-Bell in perplexity.
- 5-gram Modified Kneser-Ney model outperforms 7-gram.
- Witten-Bell improves with higher n-gram order.

## Abstract

Smoothing is one technique to overcome data sparsity in statistical language model. Although in its mathematical definition there is no explicit dependency upon specific natural language, different natures of natural languages result in different effects of smoothing techniques. This is true for Russian language as shown by Whittaker (1998). In this paper, We compared Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical language model of Bahasa Indonesia. We used train sets of totally 22M words that we extracted from Indonesian version of Wikipedia. As far as we know, this is the largest train set used to build statistical language model for Bahasa Indonesia. The experiments with 3-gram, 5-gram, and 7-gram showed that Modified Kneser-Ney consistently outperforms Witten-Bell smoothing technique in term of perplexity values. It is interesting to note that our experiments showed 5-gram model for Modified Kneser-Ney smoothing technique outperforms that of 7-gram. Meanwhile, Witten-Bell smoothing is consistently improving over the increase of n-gram order.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1706.07786/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1706.07786/full.md

## References

8 references — full list in the complete paper: https://tomesphere.com/paper/1706.07786/full.md

---
Source: https://tomesphere.com/paper/1706.07786