Transducing Language Models
V\'esteinn Sn{\ae}bjarnarson, Samuel Kiegeland, Tianyu Liu, Reda Boumasmoud, Ryan Cotterell, Tim Vieira

TL;DR
This paper introduces a formal framework for transforming language models via deterministic string-to-string functions, enabling adaptation to different output formats without retraining.
Contribution
It formalizes the use of finite-state transducers for transforming language models and develops algorithms for composition, marginalization, and conditioning on transformed outputs.
Findings
Effective conversion of language models from tokens to bytes, words, and DNA to amino acids.
Algorithms enable inference-time adaptation without retraining.
Demonstrated improvements in matching application-specific output formats.
Abstract
Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function to a random variable yields a transformed random variable with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on…
Peer Reviews
Decision·ICLR 2026 Poster
S1. I think this is a really interesting problem, and it's both well-explained and well-explored in this work. I can immediately see how this would be useful. S2. I really like the framing of this with FSTs; it's an intuitive way to think about the class of transformations and a really nice formalism. The walkthrough of the math is detailed and cleanly developed.
W1. While overall I think the explanation is good, there are several parts of the paper that I think require context from "From Language Models over Tokens to Language Models over Characters" to fully understand. The two most obvious ones to me: (1) the discussion in lines 67-69 of applications to computational psycholinguistics and controlled generation; (2) the relation between transducing and token healing. W2. The work is largely a generalization of a prior (also cool!) result. I think thi
The idea is creative, well motivated, theoretically interesting, and practically useful. This is a classic application of algorithms, probability, and formal language theory. The examples in the appendix are informative and give a good intuition for how the method works.
I am quite satisfied with the presentation and ideas in this paper and do not see any significant reasons why it should not be accepted.
Clear, principled generalization of the strict-monotone case with an identity that is easy to implement (quotient + remainder). Practical algorithms with sensible speedups and demonstrated accuracy–speed trade-offs on non-trivial mappings. Overall the definitions, explanations, and conditions are crisp and easy to follow.
Key algorithms and experimental details are mostly in the appendix, and the approximations in 5.2 are described very briefly. Readers may miss core contributions if they don’t consult the appendix. Just a suggestion, feel free to ignore if it doesn't make sense: bring a minimal algorithm box/flowchart and a small table that lists each approximation knob (lazy determinization, memoization, pruning, caps) with default settings into the main text. The paper varies $\tau$ but does not isolate the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Natural Language Processing Techniques · Data Quality and Management
