On the Learning Dynamics of Attention Networks
Rahul Vashisht, Harish G. Ramaswamy

TL;DR
This paper investigates the learning dynamics of attention networks under different loss functions, revealing distinct behaviors and proposing a hybrid approach that leverages their advantages for improved performance.
Contribution
It provides a theoretical analysis of attention learning dynamics, derives closed-form parameter trajectories, and introduces a hybrid loss method that enhances attention model training.
Findings
Soft attention improves quickly initially but stagnates later.
Hard attention shows the opposite pattern, improving slowly at first.
A hybrid approach combining both losses outperforms individual paradigms.
Abstract
Attention models are typically learned by optimizing one of three standard loss functions that are variously called -- soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models -- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes the selected segment into the target label. However, they differ significantly in the way the selected segments are aggregated, resulting in distinct dynamics and final results. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed. We also analyze these paradigms in a simple setting and derive closed-form expressions for the parameter trajectory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Neural Networks and Applications · Neural dynamics and brain function
MethodsFocus
