TL;DR
TASE is a benchmark designed to evaluate multilingual LLMs' token-level awareness and structural reasoning, revealing current models' limitations and guiding future improvements in fine-grained language understanding.
Contribution
Introduces TASE, a comprehensive multilingual benchmark with a large dataset and synthetic data pipeline, to assess token-level and structural reasoning in LLMs.
Findings
Human performance exceeds LLMs on TASE tasks.
Current LLMs show weaknesses in token-level reasoning.
TASE provides insights for future model improvements.
Abstract
While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
