Bird-Eye Transformers for Text Generation Models
Lei Sha, Yuhang Song, Yordan Yordanov, Tommaso Salvatori, Thomas, Lukasiewicz

TL;DR
This paper introduces the bird-eye transformer (BET), a novel architecture that enhances text generation by reweighting self-attention to better incorporate historical information, outperforming baseline transformers across multiple datasets.
Contribution
The paper proposes BET, a new transformer variant that improves historical information utilization in text generation tasks, addressing limitations of standard self-attention.
Findings
BET outperforms baseline transformers on all tested datasets.
Reweighting self-attention improves historical context integration.
Experimental results demonstrate superior performance in machine translation and language modeling.
Abstract
Transformers have become an indispensable module for text generation models since their great success in machine translation. Previous works attribute the~success of transformers to the query-key-value dot-product attention, which provides a robust inductive bias by the fully connected token graphs. However, we found that self-attention has a severe limitation. When predicting the (i+1)-th token, self-attention only takes the i-th token as an information collector, and it tends to give a high attention weight to those tokens similar to itself. Therefore, most of the historical information that occurred before the i-th token is not taken into consideration. Based on this observation, in this paper, we propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers by reweighting self-attention to encourage it to focus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
