Assessing Keyness using Permutation Tests

Thoralf Mildenberger

arXiv:2308.13383·cs.CL·August 28, 2023

Assessing Keyness using Permutation Tests

Thoralf Mildenberger

PDF

1 Repo

TL;DR

This paper introduces a permutation-based method for assessing keyness in corpus linguistics that accounts for document-level distribution of words, providing more accurate significance testing than traditional token-based models.

Contribution

It proposes a novel resampling approach that models corpora as document samples and applies permutation tests, improving keyness significance assessment across various scores.

Findings

01

More accurate p-values for keyness scores like LLR

02

Applicable to any keyness score including effect size measures

03

Implementation available in R package `keyperm`

Abstract

We propose a resampling-based approach for assessing keyness in corpus linguistics based on suggestions by Gries (2006, 2022). Traditional approaches based on hypothesis tests (e.g. Likelihood Ratio) model the copora as independent identically distributed samples of tokens. This model does not account for the often observed uneven distribution of occurences of a word across a corpus. When occurences of a word are concentrated in few documents, large values of LLR and similar scores are in fact much more likely than accounted for by the token-by-token sampling model, leading to false positives. We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens, which is much closer to the way corpora are actually assembled. We then use a permutation approach to approximate the distribution of a given keyness score under the null hypothesis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thmild/keyperm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.