A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse
Moongyu Jeon, Sangwoo Shin, BumJun Kim, Kyelim Lee, Albert No

TL;DR
This paper provides a theoretical explanation for why masked diffusion language models better mitigate the reversal curse compared to autoregressive models, focusing on parameter-level coupling and shared evidence storage.
Contribution
It introduces a theoretical analysis demonstrating how shared Transformer parameters enable evidence transfer between forward and reverse queries in masked diffusion models.
Findings
Shared parameters store token-pair evidence facilitating transfer.
Forward training strengthens reversible evidence and aligns attention routes.
Experiments confirm improved reverse prediction in masked diffusion models.
Abstract
Autoregressive language models (ARMs) suffer from the reversal curse: after learning '' is ,'' they often fail on the reverse query '' is .'' Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing '' is '' during training teaches recovery of from in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt '' is .'' We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
