Accelerated Neural Network Training with Rooted Logistic Objectives
Zhu Wang, Praveen Raj Veluswami, Harsh Mishra, Sathya N. Ravi

TL;DR
This paper introduces a novel rooted logistic loss function that enhances neural network training by ensuring faster convergence and improved performance across various models and applications.
Contribution
The paper derives a new strictly convex loss function based on logistic landscape design, extending its application to deep models and generative tasks.
Findings
Faster convergence in training deep neural networks.
Performance improvements on classification benchmarks.
Effective application in generative model fine-tuning.
Abstract
Many neural networks deployed in the real world scenarios are trained using cross entropy based loss functions. From the optimization perspective, it is known that the behavior of first order methods such as gradient descent crucially depend on the separability of datasets. In fact, even in the most simplest case of binary classification, the rate of convergence depends on two factors: (1) condition number of data matrix, and (2) separability of the dataset. With no further pre-processing techniques such as over-parametrization, data augmentation etc., separability is an intrinsic quantity of the data distribution under consideration. We focus on the landscape design of the logistic function and derive a novel sequence of {\em strictly} convex functions that are at least as strict as logistic loss. The minimizers of these functions coincide with those of the minimum norm solution…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The RLO objective seems novel to me. Advertised to be a better alternative to logistic loss, the proposed RLO can potentially be widely applied to various supervised tasks. In this work, the authors not only evaluated standard image classification but also extended to training the discriminator in GAN models. I like the inclusion of a toy data case to provide more visual clues.
## Weak theoretical analysis First, the scope of this work is on substituting the logistic loss, or more precisely, approximating the log function with polynomials. However, in classification, the logistic loss is only one of the many surrogate losses for the more fundamental 0-1 loss. The authors did not discuss any other surrogate loss functions and how they relate to the 0-1 loss. [1] is a related work. Second, the theoretical statements are hand-wavy. For instance, the "better conditionin
The presentation of this paper is clear.
1. RLO lacks the mathematical derivation by replacing the log with $1/k$ in the cross-entropy loss. In fact, the cross-entropy is to minimize the -log P, where $P$ is the likelihood. As the data are i.i.d, the likelihood of all the data can be written as the multiplication of the likelihood of each sample, e.g., $$\log P(y_1,y_2,...,y_n|x_1,x_2,...,x_n)= \log \prod P(y_1|x_1)...P(y_n|x_n) = \sum_{i=1}^n P(y_i|x_i).$$ In this paper, it replaces log with $(\cdot)^{1/k}$ and still sums them togethe
1. The paper is overall well-organized. 2. The proposed method is interesting and novel to the best of the reviewer's knowledge.
1. The experiments conducted in the paper are based on datasets that are too simple and the corresponding baseline test accuracy is not reasonable (<90% test accuracy for Cifar-10). The reviewer would appreciate if the results of more realistic datasets could be included. 2. The reviewer didn't check the full derivation in the Appendix, but the derivation shown in the paper seems a bit sloppy (see Questions), hence hindering the soundness of the paper a bit.
1. This paper proposes a new rooted loss objective. 2. This paper is sound. 3. This paper provides a lot of experiments on different datasets of supervised and unsupervised learning.
1. The contribution of this paper is limited. The new objective loss function is based on the approximation of the natural logarithm function. If using the proposed loss objective, we introduce a new tuning parameter $k$. It may take a lot of time to tune this new parameter, but the improvement of the performance is not significant enough, as shown in Table 3. 2. Some parts of the paper are not explained clearly. For example, this paper mentions that the reason of generalization bounds for logi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Neural Network Applications
MethodsDense Connections · HuMan(Expedia)||How do I get a human at Expedia? · Adaptive Instance Normalization · R1 Regularization · Feedforward Network · Convolution · Focus · StyleGAN
