A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse

Moongyu Jeon; Sangwoo Shin; BumJun Kim; Kyelim Lee; Albert No

arXiv:2602.02133·cs.AI·May 13, 2026

A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse

Moongyu Jeon, Sangwoo Shin, BumJun Kim, Kyelim Lee, Albert No

PDF

TL;DR

This paper provides a theoretical explanation for why masked diffusion language models better mitigate the reversal curse compared to autoregressive models, focusing on parameter-level coupling and shared evidence storage.

Contribution

It introduces a theoretical analysis demonstrating how shared Transformer parameters enable evidence transfer between forward and reverse queries in masked diffusion models.

Findings

01

Shared parameters store token-pair evidence facilitating transfer.

02

Forward training strengthens reversible evidence and aligns attention routes.

03

Experiments confirm improved reverse prediction in masked diffusion models.

Abstract

Autoregressive language models (ARMs) suffer from the reversal curse: after learning '' $A$ is $B$ ,'' they often fail on the reverse query '' $B$ is $A$ .'' Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing '' $[M]$ is $B$ '' during training teaches recovery of $A$ from $B$ in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt '' $B$ is $[M]$ .'' We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.