Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten

TL;DR
This paper analyzes how different sequence representations like bytes, characters, and subword tokens affect the predictive capabilities of Transformers within fixed context windows, revealing intrinsic limitations and benefits of each approach.
Contribution
It introduces an information-theoretic framework explaining how fragmentation and tokenization impact finite-context prediction in Transformers, supported by theoretical proofs and diagnostics.
Findings
Fragmentation can increase the optimal finite-context log-loss, indicating intrinsic representation limitations.
Tokenization into larger units can effectively extend the context window's span, improving prediction.
A diagnostic measure for evaluating how well tokenizers capture source context within fixed windows.
Abstract
Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
