Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
Addison Kristanto Julistiono, Davoud Ataee Tarzanagh, Navid Azizan

TL;DR
This paper analyzes mirror descent algorithms for attention models, showing they converge to max-margin solutions, improve generalization over gradient descent, and effectively select relevant tokens in complex neural architectures.
Contribution
It introduces a theoretical framework for mirror descent in attention mechanisms, revealing convergence to max-margin solutions and demonstrating improved practical performance.
Findings
Mirror descent converges to generalized max-margin SVM solutions.
MD algorithms outperform gradient descent in generalization.
Numerical experiments show superior token selection with MD.
Abstract
Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the -th power of the -norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an -norm objective when applied to a classification problem using a softmax attention…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Theoretical Contributions: The paper provides a solid theoretical foundation for understanding the convergence properties and implicit bias of MD in attention models. The extension to $\ell_p$-norm objectives adds flexibility in modeling and opens up new avenues for optimizing attention mechanisms. 2. Generalization of Attention Optimization: The approach generalizes previous work on attention models by using MD with a broad class of potential functions, allowing a deeper exploration of the
1. As far as I know, mirror descent is not a popular optimization algorithm for training deep learning models. I agree that a simplified model (e.g., the one layer model considered in this paper and previous work) could provide valuable insights on understanding transformers, but it is not clear what is the role/implication the $\ell_p$-norm for deep learning. If possible, it would helpful if the authors could highlight a few practical mirror descent-based optimizers in the revision. 2. For Lin
- The paper is well written - It is quite easy to understand - The fact that MD achieves sparser weights and leads to better generalization is interesting.
- The motivation is quite unclear. “A broader understanding of general descent algorithms, including the mirror descent (MD) family and their token selection properties, is essential.” why? In practice, nobody trains attention layers with mirror descent. The observation that it works better than gradient descent is not very strong in my opinion, because in practice transformers cannot be trained with gradient descent either. In the experiments, it would be worthwhile to compare the proposed meth
The paper attempts to provide a theoretical analysis of mirror descent for attention training, extending prior work focused on gradient descent. It derives convergence results to a generalized hard-margin SVM and establishes convergence rates. The use of $l_p$ norms offers a degree of generality in the theoretical analysis.
1. The core idea of connecting attention optimization to SVM-like objectives is not new and has been explored in prior work, notably in "A Primal-Dual Framework for Transformers and Neural Networks" by Nguyen et al. and related papers. These prior works establish the fundamental link between attention and SVMs, including the optimization perspective. While this paper extends the analysis to mirror descent, the incremental contribution feels minimal and lacks motivation. Other core ideas of analy
1. The paper has a strong theoretical foundation, it provides rigorous mathematical analysis and proofs for the convergence properties of mirror descent in attention optimization, extending previous work on gradient descent to a more general framework. 2. This paper provides a novel algorithmic insight, the introduction of $\ell_p$-AttGD generalizes both $\ell_p$-GD and attention GD, offering new perspectives on attention optimization and token selection. 3. This work provides a complete theoret
1. The empirical evaluation is limited, the paper includes experiments, and they are primarily focused on synthetic data and a single real-world dataset (Stanford Large Movie Review Dataset). More diverse real-world applications would strengthen the practical implications. 2. Theoretical results are highly dependent on assumption, Theorem 2, Theorem 3, and Theorem 4, rely on specific assumptions about initialization and step sizes, which may limit their practical applicability. 3. The paper does
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection
MethodsAttention Is All You Need · Focus · Softmax · Support Vector Machine
