Asymmetric Momentum: A Rethinking of Gradient Descent
Gongyue Zhang, Dinghuang Zhang, Shuwen Zhao, Donghan Liu, Carrie M., Toptan, Honghai Liu

TL;DR
This paper introduces Loss-Controlled Asymmetric Momentum (LCAM), a novel SGD enhancement that adaptively accelerates parameters based on loss phases, improving training efficiency and accuracy across datasets without extra computational costs.
Contribution
The paper proposes LCAM, a simple yet effective adaptive momentum method that adjusts to all dataset types by leveraging loss phases, challenging traditional adaptive optimizer assumptions.
Findings
LCAM accelerates slow-changing parameters in sparse gradients.
LCAM accelerates frequently-changing parameters in non-sparse gradients.
LCAM achieves comparable or better accuracy with fewer epochs.
Abstract
Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we…
Peer Reviews
Decision·Submitted to ICLR 2024
The authors make an effort to explain the proposed method in an intuitive way.
1. Despite the attempt to give an intuitive explanation, many of the concepts are not well defined or explained, e.g., weight coupling, oscillatory state, coupling state. Overall, section 3 is difficult to follow, and the motivation is not convincing. 2. The experiments are only conducted on CIFAR10/100 with wide resnet, and do not show significant improvement. Moreover, the accuracy values do not have confidence intervals. 3. It is unclear how the multiple hyperparameters are determined, and no
- Some interesting experimental observations are reported. Specifically, Figure 3 and Figure 4 show that accelerating different parameter groups (sparse or non-sparse depending on the nature of the dataset) seems to lead to better test error. - The determination of sparse or non-sparse phase based on the loss seems to be intuitive given the non-sparse weights change more frequently and contribute more to the overall loss change.
- The justifications and the framework are purely heuristic. There is no quantitative arguments or actual theory to concretely explain the observed phenomenon. The linear model (e.g. eqn 1) is overly simplified and may not be able to capture the training dynamics of a non-linear neural network. - The proposed algorithm is rather restrictive to the models that are (such as wide residual network) able to extract features, which limits its applicability in other scenarios. - The current related
1. The introduction of LCAM provides a fresh perspective on optimizing the gradient descent process, especially in the context of non-sparse gradients. 2. The paper provides a solid theoretical foundation, introducing concepts like weight coupling and weight traction. 3. The experiments on Cifar10 and Cifar100 using WRN provide empirical evidence supporting the proposed method's efficacy. 4. The authors emphasize the reproducibility of their experiments, which is crucial for the scientific commu
1. The paper delves deep into theoretical aspects, which might make it challenging for readers unfamiliar with the topic. 2. The experiments are primarily conducted on Cifar10 and Cifar100. Testing on a broader range of datasets would provide a more comprehensive understanding of LCAM's applicability. 3. The mechanism for reducing the learning rate at every iteration is based on empirical observations. A more systematic approach or justification would strengthen the paper's claims. 4. The influe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and ELM · Neural Networks and Applications
MethodsAdam · Stochastic Gradient Descent
