TL;DR
Multiscreen introduces a screening mechanism in language models that explicitly filters irrelevant keys, leading to more interpretable attention, parameter efficiency, and improved stability at longer contexts.
Contribution
The paper presents Multiscreen, a novel architecture with explicit query--key relevance filtering, reducing parameters and enhancing stability compared to standard Transformers.
Findings
Achieves comparable validation loss with 30% fewer parameters.
Remains stable at larger learning rates and longer contexts.
Reduces forward-pass latency at long context lengths.
Abstract
A core limitation of standard softmax attention is that it does not provide an independently interpretable measure of query--key relevance: attention scores are unbounded, while attention weights are defined only relative to competing keys. Consequently, irrelevant keys cannot be explicitly rejected, and some attention mass is assigned even when no key is genuinely relevant. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening computes bounded query--key similarities and applies an explicit threshold, discarding irrelevant keys and aggregating the remaining keys without global competition. Across experiments, Multiscreen achieves comparable validation loss with roughly 30\% fewer parameters than a Transformer baseline and remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
