The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse
Julian Coda-Forno, Jane X. Wang, Arslan Chaudhry

TL;DR
This paper investigates how different training objectives affect language models' ability to retrieve facts in reverse order, revealing that explicit source prediction and representation geometry influence reversal success.
Contribution
It compares MLM and decoder-only masking objectives across reversal benchmarks and provides mechanistic insights into how these objectives influence fact retrieval.
Findings
Reversal accuracy depends on explicit source prediction signals.
Representation distances suggest forward and reverse facts are stored separately.
Objective-level improvements do not necessarily imply true latent generalization.
Abstract
The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on ``'' but failing on ``''). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
