A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina

TL;DR
The paper introduces Triadic Suffix Tokenization (TST), a deterministic method for tokenizing numbers that preserves structure and improves numerical reasoning in language models.
Contribution
TST provides a fixed, one-to-one mapping for number suffixes, enhancing consistency and interpretability in numerical tokenization for language models.
Findings
TST preserves exact digits and order-of-magnitude relationships.
Two implementation variants cover a wide range of magnitudes and precisions.
Framework is scalable and architecture-agnostic, suitable for integration.
Abstract
Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
