Torque-Aware Momentum
Pranshu Malviya, Goncalo Mordido, Aristide Baratin, Reza Babanezhad, Harikandeh, Gintare Karolina Dziugaite, Razvan Pascanu, Sarath Chandar

TL;DR
Torque-Aware Momentum (TAM) improves the stability and effectiveness of momentum-based optimizers in deep learning by adaptively damping updates based on gradient alignment, leading to better exploration and generalization.
Contribution
The paper introduces TAM, a novel momentum method that incorporates gradient-angle-based damping to enhance optimizer stability and performance.
Findings
TAM outperforms classical momentum in various tasks.
TAM improves exploration and handles distribution shifts better.
TAM enhances generalization in image and language tasks.
Abstract
Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper proposes an innovative concept via connecting damping mechanics to momentum updates, which adds an insightful dimension to the optimization community in ML. 2. The paper is equipped with detailed empirical analysis, which includes the comparisons over a variety of benchmarks and model types, highlighting the robustness and wide applicability of TAM proposed. 3. The concept of incorporating a damping factor contigent on gradient alignment is well-explained, and the pseudo-code presen
1. Despite the fact that TAM is computationally efficient, the requirement to compute the cosine similarity between gradients and momentum might introduce non-trivial implementation complexity in certain training frameworks. 2. While comparisons with the standard optimizers and a few existing approachs are robust, additional benchmarks against more recent optimizers concentrating on gradient stability would strengthen the claims furthermore. 3. The paper illustrates TAM's effectiveness on variou
1. A damping effect was applied based on the cosine similarity, which intuitively illustrates the relationship between gradients and momentum. 2. The experiments conducted on the Language Model (LM) side involved a diverse range of datasets. 3. The small change in gradient norm on the CIFAR10 dataset suggests that TAM has some effect in mitigating oscillations.
1. Throughout the paper, TAM and AdaTAM(W) are evaluated using a limited set of models, raising questions about whether the TAM approach works only with specific models. For instance, in the image domain, only ResNet-based models were used, while in the language model (LM) domain, only BERT-based models were employed. It is recommended to use other models in the image domain, such as MobileNet and ViT, and in the language domain, models like GPT, T5, and LLaMA. 2. In Table 1, the performance of
-As Momentum is a very important technique for optimization and deep learning, it is always interesting to see a novel and effective mometum method, such as TAM. -The experiments cover both vision tasks and language tasks. -The idea of TAM is elegant and reasonable.
-This paper lacks convergence analysis. As a optimization method, the convergence guarantee is expected. And many previous studies on momentum provided convergence analysis. -While large oscillations are bad, oscillations are sometimes good for searching minima intuitively. This paper also failed to formally explain the generalization advantage of TAM over standard momentum. Generalization bound analysis can be helpful. -The empirical improvements are marginal, especially for vision tasks. The
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning
MethodsTemporal Adaptive Module · Stochastic Gradient Descent · Adam
