SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
Shuyang Hou, Yi Hu, Muhan Zhang

TL;DR
SubTokenTest is a new benchmark designed to evaluate large language models' ability to understand sub-token information in practical tasks, addressing a key weakness in tokenization that affects real-world applications.
Contribution
The paper introduces SubTokenTest, a comprehensive benchmark with ten tasks across four domains, specifically targeting sub-token understanding and analyzing model performance and encoding strategies.
Findings
LLMs show significant challenges in sub-token understanding.
Test-time scaling improves sub-token reasoning performance.
Character-level information is variably encoded in hidden states.
Abstract
Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
