Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?
Sathvik Nair, Philip Resnik

TL;DR
This study investigates whether different tokenization methods, including subword BPE and morphological segmentation, affect the accuracy of surprisal estimates in predicting reading times, finding BPE performs comparably overall but has limitations.
Contribution
It provides a comprehensive comparison of orthographic, morphological, and BPE tokenization for surprisal estimation in psycholinguistic data, highlighting strengths and limitations of each approach.
Findings
BPE tokenization yields surprisal predictions similar to morphological segmentation overall.
Finer-grained analysis reveals potential issues with BPE-based surprisal estimates.
Morphologically-aware surprisal estimates show promising results and suggest new evaluation methods.
Abstract
An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization, not decomposition of words into morphemes. Does that matter? We carefully test this by comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data. Our results replicate previous findings and provide evidence that in the aggregate, predictions using BPE tokenization do not suffer relative to morphological and orthographic segmentation. However, a finer-grained analysis points to potential issues with relying on BPE-based tokenization, as well as providing promising results involving morphologically-aware surprisal estimates and suggesting a new method for evaluating morphological prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Authorship Attribution and Profiling
MethodsByte Pair Encoding
