An Information-Theoretic View for Deep Learning
Jingwei Zhang, Tongliang Liu, Dacheng Tao

TL;DR
This paper uses information theory to analyze why deep neural networks generalize well, showing that increasing layers can exponentially reduce expected generalization error under certain conditions.
Contribution
It derives an upper bound on generalization error based on mutual information and layer depth, providing theoretical insights into deep learning's effectiveness.
Findings
Deeper networks can exponentially decrease generalization error.
Convolutional layers with information loss reduce overall error.
Deeper networks require less sample complexity for stability.
Abstract
Deep learning has transformed computer vision, natural language processing, and speech recognition\cite{badrinarayanan2017segnet, dong2016image, ren2017faster, ji20133d}. However, two critical questions remain obscure: (1) why do deep neural networks generalize better than shallow networks; and (2) does it always hold that a deeper network leads to better performance? Specifically, letting be the number of convolutional and pooling layers in a deep neural network, and be the size of the training sample, we derive an upper bound on the expected generalization error for this network, i.e., \begin{eqnarray*} \mathbb{E}[R(W)-R_S(W)] \leq \exp{\left(-\frac{L}{2}\log{\frac{1}{\eta}}\right)}\sqrt{\frac{2\sigma^2}{n}I(S,W) } \end{eqnarray*} where is a constant depending on the loss function, is a constant depending on the information loss for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
