On the Runway Cascade of Transformers for Language Modeling
Hunjae Lee, Corey Clark

TL;DR
This paper introduces runway-aware rewiring in decoder-only transformers to address information propagation issues, leading to improved language modeling, retrieval, and extrapolation without adding extra parameters.
Contribution
It formalizes the runway cascade phenomenon and proposes a parameter-free rewiring method to enhance information flow in causal transformers.
Findings
Improved language modeling performance
Enhanced information retrieval capabilities
Better extrapolation abilities
Abstract
In decoder-only (causal) transformers, the computation graph created by causal masking routes information through both direct-path attention and indirect paths formed by intermediate tokens. We denote these indirect paths between token pairs as their runways. We argue that certain failure modes of causal transformers as observed by a growing body of recent works are likely exacerbated by a misalignment between these two information propagation modes. We formalize runway cascade as a phenomenon whereby this misalignment results in redundancies and irrelevant information cascading to token representations despite adequately learned attention patterns. As a solution, we propose runway-aware rewiring as a more explicit way of incorporating runway context directly into each token's direct-path attention. This mechanism re-wires the attention pattern for each token based on a summary of its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
