Copy this Sentence
Vasileios Lioutas, Andriy Drozdyuk

TL;DR
This paper formally defines the attention operation, explores its application to sequence-to-sequence models, and demonstrates that greater use of attention improves performance, convergence speed, and stability on copying tasks.
Contribution
It provides a rigorous mathematical definition of attention and links it to practical implementations, highlighting its benefits in sequence-to-sequence learning.
Findings
Models with more attention perform better on copying tasks.
Attention-based models converge faster.
Attention improves model stability.
Abstract
Attention is an operation that selects some largest element from some set, where the notion of largest is defined elsewhere. Applying this operation to sequence to sequence mapping results in significant improvements to the task at hand. In this paper we provide the mathematical definition of attention and examine its application to sequence to sequence models. We highlight the exact correspondences between machine learning implementations of attention and our mathematical definition. We provide clear evidence of effectiveness of attention mechanisms evaluating models with varying degrees of attention on a very simple task: copying a sentence. We find that models that make greater use of attention perform much better on sequence to sequence mapping tasks, converge faster and are more stable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
