Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks
Binchuan Qi

TL;DR
This paper introduces a conjugate learning framework for deep neural networks that characterizes trainability and generalization, linking theoretical insights with empirical validation to enhance understanding of deep learning mechanisms.
Contribution
It develops a novel conjugate duality-based theory for DNN learnability, providing bounds on training and generalization errors, and analyzing the effects of architecture and data.
Findings
Training with mini-batch SGD achieves global empirical risk optima.
Model architecture and batch size significantly influence optimization dynamics.
Theoretical bounds on generalization error depend on information loss and feature-label entropy.
Abstract
In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
