Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching
Etrit Haxholli, Yeti Z. Gurbuz, Ogul Can, Eli Waxman

TL;DR
This paper introduces a novel discrete flow matching framework using minibatch optimal transport, reducing transitions significantly and proposing bounds for perplexity estimation, with Multimask Flows outperforming previous methods.
Contribution
It develops a dynamic optimal transport-based minimization for discrete flows, derives its Kantorovich formulation, and introduces Multimask Flows for improved generative performance.
Findings
Reduced number of state transitions by up to 32 times.
Proposed perplexity bounds enable better model evaluation.
Multimask Flows outperform masked flows in perplexity without losing diversity.
Abstract
Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamic-optimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state similarity and can be optimized via minibatch strategies. We show that such methods can reduce the number of transitions up to 32 times (1024 to 32) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is very well written, mathematical theorem clear and well presented. 2. The paper formulate dynamical OT for discrete state spaces and prove the relation to Kantorovich formulation. This allows training with minibatch-OT. 3. Table 3 shows new and interesting results for minibatch-OT on generative perplexity.
1. in subsection 5.2 If the only difference between Table 2 and the table presented in [1] is the model trained with (1-t) time weighting then it is of low novelty compared to previous works. 2. As reported in the appendix, while multimask source gives an improvement in results of generative perplexity with minibatch-OT, it does hurt the perplexity of both with and without minibatch-OT (table 8). 3. The authors states that computing minibatch OT on a batch of 1000 samples adds approximately 3.
- The provided proofs seem mostly fine. - The connections between the Hamming distance and the metric are interesting. - The connection between the discrete dynamic problem and the Kantorovich formulation seems novel. - The authors introduce Multimask Flow, where the vocabulary now also contains multiple mask tokens.
- Poor typesetting overall. Examples include: line 127 (should not be numbered, nor in align/gather environments), line 159 – 161 (unindented), all tables could benefit from using booktabs, proofs in appendix have lines ending with equality sign – which should really be at the beginning. - It seems that the contribution in section 3 is poor. I must admit that I am not even certain of what is exactly proved. It does seem that it introduces the dynamic formulation for the discrete optimal transpor
- The categorical Benamou–Brenier-style equivalence (Theorem 3.1) is interesting. The dynamic jump-minimization equals a Kantorovich problem with cost $c(x_0,x_1)=\sum_i s(x_0^i,x_1^i)$, recovering Hamming distance and L2-embedding costs for different choices of $s$. - On OpenWebText with GPT-2–sized models, minibatch-OT cuts steps by ~8× to match the non-OT model’s generative perplexity. Training overhead is reported at ~3.4% for L=128, with favorable scaling to longer sequences.
- The paper uses generative perplexity judged with external LMs. This could cause judge bias with the choice of external LMs. - Reported headline gains are emphasized on the judged metric rather than the bounds themselves. - Most results are GPT-2-scale on BoW settings. It’s unclear how benefits translate to larger LMs or real web texts. The method likely generalizes, but evidence is limited.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Data Stream Mining Techniques · Traffic Prediction and Management Techniques
MethodsDiffusion
