TL;DR
This paper demonstrates that a shared-weight recurrent Transformer can develop distinct internal roles through input asymmetry and state dynamics, leading to emergent specialization without explicit partitioning.
Contribution
It introduces the AIR architecture showing how a single shared Transformer can spontaneously develop specialized internal states through input injection differences.
Findings
Shared model develops distinct proposal and uncertainty states.
Input asymmetry and state dynamics induce specialization.
Attention analysis reveals different localities for update types.
Abstract
Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: behaves like a fully committed proposal state, whereas retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing reduces 's content changes whereas freezing increases 's, while in Maze, freezing either state increases content changes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
