Modeling the Unigram Distribution

Irene Nikkarinen; Tiago Pimentel; Dami\'an E. Blasi; Ryan Cotterell

arXiv:2106.02289·cs.CL·June 7, 2021

Modeling the Unigram Distribution

Irene Nikkarinen, Tiago Pimentel, Dami\'an E. Blasi, Ryan Cotterell

PDF

1 Repo

TL;DR

This paper emphasizes the importance of accurately modeling the unigram distribution in language processing, proposing a neural model that outperforms simple frequency-based estimates across multiple languages.

Contribution

It introduces a novel neural model for estimating the unigram distribution, addressing biases of traditional frequency-based methods.

Findings

01

The neural model provides better unigram estimates than frequency-based methods.

02

The approach is effective across 7 diverse languages.

03

It reduces bias in probability estimation for out-of-vocabulary words.

Abstract

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution -- claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the na\"ive use of neural character-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

irenenikk/modelling-unigram
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.