Confounding Factors in Relating Model Performance to Morphology
Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux

TL;DR
This paper investigates how morphological features influence language model performance, identifies confounding factors in previous analyses, and proposes new metrics to better understand the relationship between morphology and modeling difficulty.
Contribution
It highlights confounding factors in prior studies, re-assesses hypotheses on morphology's impact, and introduces token bigram metrics as intrinsic predictors of modeling difficulty.
Findings
Confounding factors affect previous conclusions about morphology and language modeling.
Token bigram metrics correlate with morphological complexity without expert annotation.
Re-assessment of hypotheses reveals confounding influences in earlier studies.
Abstract
The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling
