On the Proper Treatment of Tokenization in Psycholinguistics
Mario Giulianelli, Luca Malagutti, Juan Luis Gastaldi, Brian DuSell,, Tim Vieira, Ryan Cotterell

TL;DR
This paper addresses the misalignment issue caused by tokenization in language models used for psycholinguistic research and proposes marginalizing token-level models into character-level models to improve the accuracy of surprisal measurements.
Contribution
It introduces a method to marginalize token-level language models into character-level models, solving tokenization misalignment issues in psycholinguistic surprisal analysis.
Findings
Marginalized models provide better psychometric predictions.
Certain focal areas' surprisal outperforms traditional region surprisal.
The approach is tokenization scheme-independent.
Abstract
Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLinguistic research and analysis
