Words, Subwords, and Morphemes: What Really Matters in the   Surprisal-Reading Time Relationship?

Sathvik Nair; Philip Resnik

arXiv:2310.17774·cs.CL·October 30, 2023·1 cites

Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?

Sathvik Nair, Philip Resnik

PDF

Open Access

TL;DR

This study investigates whether different tokenization methods, including subword BPE and morphological segmentation, affect the accuracy of surprisal estimates in predicting reading times, finding BPE performs comparably overall but has limitations.

Contribution

It provides a comprehensive comparison of orthographic, morphological, and BPE tokenization for surprisal estimation in psycholinguistic data, highlighting strengths and limitations of each approach.

Findings

01

BPE tokenization yields surprisal predictions similar to morphological segmentation overall.

02

Finer-grained analysis reveals potential issues with BPE-based surprisal estimates.

03

Morphologically-aware surprisal estimates show promising results and suggest new evaluation methods.

Abstract

An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization, not decomposition of words into morphemes. Does that matter? We carefully test this by comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data. Our results replicate previous findings and provide evidence that in the aggregate, predictions using BPE tokenization do not suffer relative to morphological and orthographic segmentation. However, a finer-grained analysis points to potential issues with relying on BPE-based tokenization, as well as providing promising results involving morphologically-aware surprisal estimates and suggesting a new method for evaluating morphological prediction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Authorship Attribution and Profiling

MethodsByte Pair Encoding