An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models
Yunzhe Hu, Difan Zou, Dong Xu

TL;DR
This paper investigates the Sparse Rate Reduction (SRR) objective in Transformer-like models, analyzing its implementation, relationship to generalization, and potential as a regularizer, supported by theoretical and empirical evidence.
Contribution
It provides a detailed analysis of SRR's implementations, demonstrates its correlation with generalization, and shows how SRR can be used to improve model performance through regularization.
Findings
SRR positively correlates with generalization performance.
SRR outperforms baseline complexity measures like path-norm and sharpness.
Using SRR as regularization improves image classification results.
Abstract
Deep neural networks have long been criticized for being black-box. To unveil the inner workings of modern neural architectures, a recent work \cite{yu2024white} proposed an information-theoretic objective function called Sparse Rate Reduction (SRR) and interpreted its unrolled optimization as a Transformer-like model called Coding Rate Reduction Transformer (CRATE). However, the focus of the study was primarily on the basic implementation, and whether this objective is optimized in practice and its causal relationship to generalization remain elusive. Going beyond this study, we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. To reveal the predictive power of SRR on generalization, we collect a set of model variants induced by varied implementations and hyperparameters and evaluate SRR as a complexity measure based on its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImage and Signal Denoising Methods
MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax
