On the Proper Treatment of Tokenization in Psycholinguistics

Mario Giulianelli; Luca Malagutti; Juan Luis Gastaldi; Brian DuSell,; Tim Vieira; Ryan Cotterell

arXiv:2410.02691·cs.CL·December 9, 2024

On the Proper Treatment of Tokenization in Psycholinguistics

Mario Giulianelli, Luca Malagutti, Juan Luis Gastaldi, Brian DuSell,, Tim Vieira, Ryan Cotterell

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper addresses the misalignment issue caused by tokenization in language models used for psycholinguistic research and proposes marginalizing token-level models into character-level models to improve the accuracy of surprisal measurements.

Contribution

It introduces a method to marginalize token-level language models into character-level models, solving tokenization misalignment issues in psycholinguistic surprisal analysis.

Findings

01

Marginalized models provide better psychometric predictions.

02

Certain focal areas' surprisal outperforms traditional region surprisal.

03

The approach is tokenization scheme-independent.

Abstract

Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rycolab/psycho-toke
noneOfficial

Videos

On the Proper Treatment of Tokenization in Psycholinguistics· underline

Taxonomy

TopicsLinguistic research and analysis