Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks
Like Hui, Mikhail Belkin

TL;DR
This study challenges the common belief that cross-entropy loss outperforms square loss in neural network classification, showing that square loss often yields comparable or better results across various tasks and architectures.
Contribution
The paper provides empirical evidence that square loss can be as effective or better than cross-entropy for training neural classifiers across multiple domains.
Findings
Square loss performs comparably or better in NLP and ASR tasks.
Cross-entropy has a slight advantage in computer vision tasks.
Square loss training is less sensitive to initialization randomness.
Abstract
Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Machine Learning and Data Classification
