Towards Better Generalization via Distributional Input Projection Network

Yifan Hao; Yanxin Lu; Hanning Zhang; Xinwei Shen; Tong Zhang

arXiv:2506.04690·cs.LG·September 30, 2025

Towards Better Generalization via Distributional Input Projection Network

Yifan Hao, Yanxin Lu, Hanning Zhang, Xinwei Shen, Tong Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DIPNet, a novel neural network framework that projects inputs into learnable distributions at each layer, leading to smoother loss landscapes and improved generalization across diverse architectures and tasks.

Contribution

DIPNet is a new approach that enforces input smoothness via distributional projections, enhancing generalization in overparameterized models.

Findings

01

DIPNet reduces the Lipschitz constant of networks.

02

DIPNet improves test performance across various architectures.

03

DIPNet enhances robustness to adversarial and out-of-distribution inputs.

Abstract

As overparameterized models become increasingly prevalent, training loss alone offers limited insight into generalization performance. While smoothness has been linked to improved generalization across various settings, directly enforcing smoothness in neural networks remains challenging. To address this, we introduce Distributional Input Projection Networks (DIPNet), a novel framework that projects inputs into learnable distributions at each layer. This distributional representation induces a smoother loss landscape with respect to the input, promoting better generalization. We provide theoretical analysis showing that DIPNet reduces both local smoothness measures and the Lipschitz constant of the network, contributing to improved generalization performance. Empirically, we validate DIPNet across a wide range of architectures and tasks, including Vision Transformers (ViTs), Large…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The per-layer Gaussian projection with k-trajectory averaging integrates cleanly; the implementation steps are clearly stated. - Proofs that smoothing can bound the Lipschitz constant and reduce second-order smoothness support the generalization narrative (I have not fully verified the proofs). - The paper includes comprehensive setups and supportive ablation studies.

Weaknesses

- The paper is poorly written and needs reorganization. Please add informative captions to all tables/figures and avoid pasting raw W&B screenshots; re-plot with consistent styling and legible axes/legend. - The method is computationally expensive, which requires m forward passes per example. - Reported fine-tuned results appear lower than widely reported pretrained baselines on GSM8K (e.g., Qwen2.5-3B ≈ 79.1; Llama-3.1-8B ≈ 84.5, per the Qwen 2.5 paper). - Marginal improvements over other simpl

Reviewer 02Rating 4Confidence 3

Strengths

1. Comprehensive experiments across state-of-the-art vision and language models. 2. Strong theoretical grounding linking distributional projection to smoothness and generalization. 3. Improves not only standard generalization but also robustness to adversarial, OOD, and reasoning benchmarks.

Weaknesses

1. Although motivated by smoothness, the intuition behind why distributional projection helps over simpler regularization is not fully disentangled. 2. The method introduces substantial computational overhead, and its effectiveness appears to rely heavily on distillation, raising concerns about efficiency and practicality in large-scale training.

Reviewer 03Rating 2Confidence 4

Strengths

The paper is generally well-written and easy to follow. The authors run experiments on multiple architectures and tasks (MLPs, CNN/ViT, a language model), indicating an effort toward broader validation. Some empirical gains are visible, suggesting the idea could have regularization benefits. The attempt to connect generalization behavior to smoothness properties is conceptually aligned with robust learning literature.

Weaknesses

1. Misrepresentation of randomized smoothing literature. The manuscript repeatedly refers to “random smoothing” and incorrectly attributes adversarial training to Cohen et al. (2019). Cohen et al. established Gaussian randomized smoothing certificates using Neyman–Pearson and did not perform adversarial training. Salman et al. later connected smoothing to Lipschitz control, but this distinction is blurred or incorrect in multiple places. Example: Line 239: “and adversarial training (Cohen et al.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications