Towards Better Generalization via Distributional Input Projection Network
Yifan Hao, Yanxin Lu, Hanning Zhang, Xinwei Shen, Tong Zhang

TL;DR
This paper introduces DIPNet, a novel neural network framework that projects inputs into learnable distributions at each layer, leading to smoother loss landscapes and improved generalization across diverse architectures and tasks.
Contribution
DIPNet is a new approach that enforces input smoothness via distributional projections, enhancing generalization in overparameterized models.
Findings
DIPNet reduces the Lipschitz constant of networks.
DIPNet improves test performance across various architectures.
DIPNet enhances robustness to adversarial and out-of-distribution inputs.
Abstract
As overparameterized models become increasingly prevalent, training loss alone offers limited insight into generalization performance. While smoothness has been linked to improved generalization across various settings, directly enforcing smoothness in neural networks remains challenging. To address this, we introduce Distributional Input Projection Networks (DIPNet), a novel framework that projects inputs into learnable distributions at each layer. This distributional representation induces a smoother loss landscape with respect to the input, promoting better generalization. We provide theoretical analysis showing that DIPNet reduces both local smoothness measures and the Lipschitz constant of the network, contributing to improved generalization performance. Empirically, we validate DIPNet across a wide range of architectures and tasks, including Vision Transformers (ViTs), Large…
Peer Reviews
Decision·Submitted to ICLR 2026
- The per-layer Gaussian projection with k-trajectory averaging integrates cleanly; the implementation steps are clearly stated. - Proofs that smoothing can bound the Lipschitz constant and reduce second-order smoothness support the generalization narrative (I have not fully verified the proofs). - The paper includes comprehensive setups and supportive ablation studies.
- The paper is poorly written and needs reorganization. Please add informative captions to all tables/figures and avoid pasting raw W&B screenshots; re-plot with consistent styling and legible axes/legend. - The method is computationally expensive, which requires m forward passes per example. - Reported fine-tuned results appear lower than widely reported pretrained baselines on GSM8K (e.g., Qwen2.5-3B ≈ 79.1; Llama-3.1-8B ≈ 84.5, per the Qwen 2.5 paper). - Marginal improvements over other simpl
1. Comprehensive experiments across state-of-the-art vision and language models. 2. Strong theoretical grounding linking distributional projection to smoothness and generalization. 3. Improves not only standard generalization but also robustness to adversarial, OOD, and reasoning benchmarks.
1. Although motivated by smoothness, the intuition behind why distributional projection helps over simpler regularization is not fully disentangled. 2. The method introduces substantial computational overhead, and its effectiveness appears to rely heavily on distillation, raising concerns about efficiency and practicality in large-scale training.
The paper is generally well-written and easy to follow. The authors run experiments on multiple architectures and tasks (MLPs, CNN/ViT, a language model), indicating an effort toward broader validation. Some empirical gains are visible, suggesting the idea could have regularization benefits. The attempt to connect generalization behavior to smoothness properties is conceptually aligned with robust learning literature.
1. Misrepresentation of randomized smoothing literature. The manuscript repeatedly refers to “random smoothing” and incorrectly attributes adversarial training to Cohen et al. (2019). Cohen et al. established Gaussian randomized smoothing certificates using Neyman–Pearson and did not perform adversarial training. Salman et al. later connected smoothing to Lipschitz control, but this distinction is blurred or incorrect in multiple places. Example: Line 239: “and adversarial training (Cohen et al.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
