Using Fisher's Exact Test to Evaluate Association Measures for N-grams

Yves Bestgen

arXiv:2104.14209·cs.CL·April 30, 2021

Using Fisher's Exact Test to Evaluate Association Measures for N-grams

Yves Bestgen

PDF

Open Access

TL;DR

This paper evaluates lexical association measures for n-grams using an extension of Fisher's exact test on a large corpus, revealing that MI3 performs well and some measures vary in efficiency with n-gram length.

Contribution

It introduces an extension of Fisher's exact test for longer sequences and compares various association measures on a large corpus.

Findings

01

Simple-ll is highly effective.

02

MI3 outperforms other hypothesis test-based measures.

03

Some measures are more efficient for 3-grams than 2-grams.

Abstract

To determine whether some often-used lexical association measures assign high scores to n-grams that chance could have produced as frequently as observed, we used an extension of Fisher's exact test to sequences longer than two words to analyse a corpus of four million words. The results, based on the precision-recall curve and a new index called chance-corrected average precision, show that, as expected, simple-ll is extremely effective. They also show, however, that MI3 is more efficient than the other hypothesis tests-based measures and even reaches a performance level almost equal to simple-ll for 3-grams. It is additionally observed that some measures are more efficient for 3-grams than for 2-grams, while others stagnate.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Mining Algorithms and Applications