An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression
Lijia Zhou, James B. Simon, Gal Vardi, Nathan Srebro

TL;DR
This paper investigates the impact of overfitting in kernel ridge regression by analyzing the ratio of test errors between interpolating and optimally-tuned models, providing insights into different overfitting regimes.
Contribution
It introduces an agnostic framework to quantify overfitting costs across various sample sizes and target functions, using a Gaussian universality approach.
Findings
Characterizes benign, tempered, and catastrophic overfitting regimes
Provides refined understanding of overfitting behavior in kernel ridge regression
Utilizes risk estimates based on task eigenstructure
Abstract
We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an "agnostic" view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022).
Peer Reviews
Decision·ICLR 2024 poster
1. The work is well organized. It studies three types of overfitting: benign overfitting, tempered overfitting, and catastrophic overfitting separately. 2. The work proves matching upper and lower bounds. It gives a necessary and sufficient condition, dependent on the effective ranks of the covariance matrix, $\lim_{k\rightarrow \infty} k/r_k$, to determine whether the overfitting is benign, tempered, or catastrophic. This resolves an open problem in Mallinar et al. (2022) 3. The work provides
1. For people unfamiliar with the literature, some concepts are hard to understand, for example, “omniscient risk estimate” and the “cost of overfitting.” Can you provide more intuition about these parts? 2. This work focuses on the specific problem of linear ridge regression. The analysis highly depends on the concrete structure of this problem. It might be hard to generalize to other problems of interest. 3. For a purely theoretical paper, this work provides its main results directly, witho
The paper asks and interesting questions and formalizes this question nicely. Understanding the worst-case values of this $\text{ratio}$ across all conditional distributions $y|x$ is a good framework. The question carries sufficient significance in producing a fairly thorough understanding of a fundamental ML task (KRR) with and without regularization. The paper makes a good effort in making the story of the results pretty clear -- we start with a naturally interesting problem, formalize it mat
### Overview The paper has some presentation and significance issues. I'll give my opinions and thoughts at the top of this box, and back it up with evidence at the bottom of this box. For significance, the worst-case bounds achieved in this agnostic model are tight only in unrealistic cases. Specifically, the worst-case ratio $\mathcal{E}_0$ should only be representative of the $\text{ratio}$ for a real ML task if the Bayes-Optimal estimator is the always-zero function. It's understandable an
The authors offer an intricate and comprehensive examination of closed-form risk estimates in kernel ridge regression (KRR), along with a nuanced analysis of the conditions that lead to overfitting being benign, tempered, or catastrophic. The utilization of Mercer’s decomposition and the application of bi-criterion optimization within the KRR framework are particularly notable aspects of the study. Additionally, the paper is well-organized, presenting its complex ideas in a coherent structure. T
The paper maintains a highly specialized focus, concentrating predominantly on kernel ridge regression. This narrow scope raises questions about the generalizability of its findings to other model types, such as kernel SVM, which may not align precisely with the conditions and scenarios discussed. Despite its comprehensive and in-depth theoretical analysis, a notable limitation is the absence of empirical validation. The inclusion of studies utilizing synthetic or real-world data to substantiate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Sparse and Compressive Sensing Techniques · Statistical Methods and Inference
