Discrete Adjoint Matching

Oswin So; Brian Karrer; Chuchu Fan; Ricky T. Q. Chen; Guan-Horng Liu

arXiv:2602.07132·stat.ML·February 17, 2026

Discrete Adjoint Matching

Oswin So, Brian Karrer, Chuchu Fan, Ricky T. Q. Chen, Guan-Horng Liu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Discrete Adjoint Matching (DAM), a novel method for fine-tuning discrete generative models like diffusion-based language models, by adapting continuous adjoint matching techniques to discrete settings.

Contribution

DAM is the first discrete variant of Adjoint Matching, enabling effective fine-tuning of discrete models using a new statistical estimator derived from the original continuous framework.

Findings

01

DAM outperforms baseline methods on synthetic tasks

02

DAM effectively handles discrete state spaces in language models

03

The approach opens new avenues for adjoint-based estimators in discrete domains

Abstract

Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The authors provide a deep theoretical analysis to prove their method for discrete version of AM. They use fixed-point equations to prove that their practical algorithm is guaranteed to converge to the true, theoretically perfect optimal solution - The authors address a computationally impossible problem in their theoretically optimal solution. They then methodically build a practical solution: estimation via sampling and approximate the correction factor by sampling a few possible futures (K

Weaknesses

- The algorithm requires $K$ model-forward passes per training step to build its estimator. While this is clearly effective on an 8B model, the cost for fine-tuning much larger models (e.g., 70B+) is not discussed. A small experiment reporting training time vs. final accuracy for DAM and D1 would make the paper's practical claims much stronger. - A valuable addition to the empirical analysis would be an ablation study on the number of samples K used in the importance-weighted estimator.

Reviewer 02Rating 8Confidence 1

Strengths

1, The motivation is clear and significant, locating at the need of reward-guided fine-tuning of discrete diffusion-based models. 2, The theoretical seems to be sound.

Weaknesses

This seems to be a quite good paper. But I am not a theory expert. So I will be alert to any issues raised by other reviewers. Also, I want to raise a question about the performance of Llada-8b on GSM-8K. According to [A], the performance of base Llada model on GSM-8K is 80+. But in your paper, the performance is 60-70. Could you please explain this gap? Reference: [A] Revolutionizing Reinforcement Learning Framework for Diffusion Large

Reviewer 03Rating 6Confidence 4

Strengths

1. **Clear conceptual motivation:** The paper addresses a timely and well-motivated gap — extending adjoint-based optimization methods, previously limited to continuous diffusion models, to the discrete generative setting, which is crucial for language and symbolic models. 2. **Principled extension of Adjoint Matching:** DAM is a nontrivial discrete analogue of Adjoint Matching (AM), retaining its optimization-by-simulation philosophy while adapting it to the constraints of discrete-time, discre

Weaknesses

1. **Clarity and depth of the theoretical exposition:** The theoretical development is solid and well-motivated, but occasionally dense. Some key derivations—particularly the transition from Dynkin’s formulation to the discrete adjoint system—could be presented with more intuition and interpretive discussion, to help the reader understand the underlying mechanics beyond the formal algebra. 2. **Limited discussion of importance sampling techniques:** The paper briefly introduces importance weight

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Formal Methods in Verification