TL;DR
This paper identifies how tokenization fragmentation of dates hampers temporal reasoning in language models, introduces metrics and benchmarks to measure and analyze this issue, and uncovers emergent date-abstraction mechanisms in large models.
Contribution
It introduces the date fragmentation ratio metric, releases DateAugBench for evaluation, and reveals how large language models develop date abstraction capabilities.
Findings
Excessive date fragmentation reduces accuracy on temporal tasks.
Larger models develop date abstraction faster.
Models follow a different reasoning path than humans for date assembly.
Abstract
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsFragmentation · Byte Pair Encoding
