TL;DR
This paper introduces a value-aware approach to approximate attention in Transformers, emphasizing the importance of incorporating value vectors for improved accuracy and proposing kernel choices that enhance sparse approximation quality.
Contribution
It presents a novel value-aware objective for attention approximation, demonstrating its superiority over value-ignoring methods in language modeling tasks.
Findings
Value-aware approximation outperforms traditional methods in language modeling.
Kernel functions with less skewness improve sparse approximation quality.
Theoretical and empirical evidence supports the importance of value vectors in attention approximation.
Abstract
Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. However, all approximations thus far have ignored the contribution of the to the quality of approximation. In this work, we argue that research efforts should be directed towards approximating the true output of the attention sub-layer, which includes the value vectors. We propose a value-aware objective, and show theoretically and empirically that an optimal approximation of a value-aware objective substantially outperforms an optimal approximation that ignores values, in the context of language modeling. Moreover, we show that the choice of kernel function for computing attention similarity can substantially affect the quality of sparse approximations, where kernel functions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSix Ways To Communicate To Someone At Expedia Via Phone And Email's.
