Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Lin Zheng; Vasilisa Bashlovkina; Timothy Dozat; Dan Garrette; Laura Rimell; Joshua Maynez

arXiv:2605.09630·cs.CL·May 12, 2026

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Lin Zheng, Vasilisa Bashlovkina, Timothy Dozat, Dan Garrette, Laura Rimell, Joshua Maynez

PDF

TL;DR

Scratchpad Patching enhances byte-level language models by dynamically updating context within patches, reducing compute and cache needs while maintaining or improving model quality.

Contribution

The paper introduces Scratchpad Patching, a novel method that decouples compute from patch size by inserting transient scratchpads to improve model performance.

Findings

01

SP matches baseline quality at 16-byte patches with less compute.

02

SP reduces KV cache by 16 times and inference compute by 3-4 times.

03

SP improves natural language and code modeling performance.

Abstract

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.