Zero Generalization Error Theorem for Random Interpolators via Algebraic Geometry
Naoki Yoshida, Isao Ishikawa, Masaaki Imaizumi

TL;DR
This paper proves that in a teacher-student setting, the generalization error of random interpolators becomes zero once training samples surpass a geometric threshold, using algebraic geometry tools.
Contribution
It provides a theoretical proof that random interpolators can achieve zero generalization error based on geometric properties, advancing understanding of model generalization.
Findings
Generalization error becomes zero beyond a data threshold
Algebraic geometry characterizes interpolator geometry
Supports empirical observations of effective random interpolators
Abstract
We theoretically demonstrate that the generalization error of interpolators for machine learning models under teacher-student settings becomes 0 once the number of training samples exceeds a certain threshold. Understanding the high generalization ability of large-scale models such as deep neural networks (DNNs) remains one of the central open problems in machine learning theory. While recent theoretical studies have attributed this phenomenon to the implicit bias of stochastic gradient descent (SGD) toward well-generalizing solutions, empirical evidences indicate that it primarily stems from properties of the model itself. Specifically, even randomly sampled interpolators, which are parameters that achieve zero training error, have been observed to generalize effectively. In this study, under a teacher-student framework, we prove that the generalization error of randomly sampled…
Peer Reviews
Decision·Submitted to ICLR 2026
The problem formulation addresses generalization - one of the cornerstones of modern machine learning. The approach is novel and interesting.
The resulting bounds appear to be vacuous for the overparameterized models, even though analyzing the latter serves as the motivation in the introductions. To corroborate further, consider the simplest case where both the teacher and the student networks match each other and are given by $f(w, x) = wTx$, where $x, w \in \mathbb{R}^d$. Denote the frozen weight parameter of the teacher network by $w_*$. Take $x$ to be i.i.d. standard normal. Then, first of all, the only zero generalization error w
- The authors propose a model-based analysis of the generalization error of interpolators in a teacher-student setting. This offers an interesting perspective on generalization, showing that perfect generalisation error can be reacher when there is structure in the data distribution and the model is compatible with it (in the sense that it can interpolate). - The paper uses tools from algebraic geometry, suggesting new links between this field and statistical learning theory - The literature rev
*Main weaknesses:* - The introduction put an emphasis on the models that "employ an excessive number of parameters". However, the proposed theory states that the strong sample complexity is bounded by the number parameters of the student network. This bound seems to suggest that not too much overparameterization is allowed in order to obtain zero generalization error. Even in Theorem 6, the derived sample complexity seems to be $k = O(\sqrt{d_\Theta})$, which allows some but not arbitrary overp
- **Clear and Well-Written Presentation:** The paper is generally well-written, and the results are clean. - **Innovative Theoretical Contribution:** The paper rigorously analyzes the generalization properties of interpolators using tools from algebraic geometry. The derived results are both elegant and insightful. This paper is solid. - **Empirical Validation:** Experimental results on synthetic regression tasks and the MNIST dataset back up the theoretical findings, showing that the predicted
- **Strong & Restrictive Assumptions:** The analysis is carried out in a controlled, noiseless teacher–student setting, and the student model is assumed to be real analytic. In practical scenarios, these assumptions may not hold, e.g., the teacher model is a ReLU MLP. - **Limited Applicability to General Machine Learning Settings:** The theoretical results depend on the assumption that the teacher and student models belong to the same parametric function class—more precisely, Assumption 2 requir
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Graph Neural Networks · Gaussian Processes and Bayesian Inference
