Probability-Dependent Gradient Decay in Large Margin Softmax

Siyuan Zhang; Linbo Xie; Ying Chen

arXiv:2210.17145·stat.ML·October 10, 2023·1 cites

Probability-Dependent Gradient Decay in Large Margin Softmax

Siyuan Zhang, Linbo Xie, Ying Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a probability-dependent gradient decay hyperparameter in Softmax, analyzing its impact on training dynamics, generalization, and curriculum learning, supported by theoretical insights and empirical results across multiple datasets.

Contribution

It proposes a novel gradient decay hyperparameter in Softmax, linking large margin Softmax, local Lipschitz constraints, and curriculum learning, with a dynamic warm-up strategy for training.

Findings

01

Gradient decay rate significantly influences generalization performance.

02

Small gradient decay facilitates curriculum learning by focusing on hard samples.

03

Dynamic adjustment of gradient decay accelerates convergence during training.

Abstract

In the past few years, Softmax has become a common component in neural network frameworks. In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate during training. By following the theoretical analysis and empirical results of a variety of model architectures trained on MNIST, CIFAR-10/100 and SVHN, we find that the generalization performance depends significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a similar curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Based on the analysis results, we…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Paper proposes a simple modification to softmax in conjunction with a warm up scheme with respect to the margin parameter $\beta$ to get faster convergence and better generalization.

Weaknesses

The warm-up scheme does not seem to provide a significant advantage over prior proposed modifications to softmax (e.g A-softmax) or does worse according to table 2 in the paper.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The paper provides a very detailed guide to understand the influence of the margin parameter $\beta$ on the decay rate of the gradient. The study looks comprehensive and correct, which leads to the successful empirical verification. Most of the paper is well organized, although some part needs additional care.

Weaknesses

Section 2 needs a revision. See more in questions section. As far as I can tell, the classification error improvement is a bit marginal. The baseline accuracy should correspond to $\beta = 1$ and the highlighted best obtained errors may not have significant improvement. Although this evaluation might be objective, but this concern can be partially addressed by providing a standard deviation computed in multiple runs, so that the statistical significance can be verified. Given the concerns abov

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

I find this work well motivated: the softmax function is ubiquitous in modern machine learning and studying its various caveats is important. The connection with calibration is interesting, and the results in figure 6 are very promising. Especially, the calibration improves as $\beta$ increases, which allows the model to be less influenced by current samples having a $p_y$ close to $1$.

Weaknesses

I found two main weaknesses in this work. The first one consists of the overall lack of clarity. I find the paper hard to read. Here are some parts I found confusing: - "MSE takes into account more complex optimization scenarios": What do you mean by that? - "Hard mining strategy": you could briefly introduce what this is. - in section 2, you talk about $J_j$ before introducing it - In figure 3: there are no legends for the top row, and the caption does not help to clarify the different curves

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Brain Tumor Detection and Classification

MethodsSoftmax · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings