Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression

Debabrota Basu; Sourav Chakraborty; Debarshi Chanda; Buddha Dev Das; Arijit Ghosh; Arnab Ray

arXiv:2508.20616·cs.LG·August 29, 2025

Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression

Debabrota Basu, Sourav Chakraborty, Debarshi Chanda, Buddha Dev Das, Arijit Ghosh, Arnab Ray

PDF

Open Access

TL;DR

This paper introduces a dimension-agnostic, model-specific method for assessing survey data credibility through regression, achieving sample efficiency independent of data dimension and outperforming model reconstruction approaches.

Contribution

The paper presents a novel, sample-efficient algorithm for survey credibility testing that is independent of data dimension, focusing on verification rather than model reconstruction.

Findings

01

Algorithm's sample complexity is independent of data dimension.

02

Verification approach outperforms model reconstruction in efficiency.

03

Theoretical proof and numerical validation confirm effectiveness.

Abstract

Assessing whether a sample survey credibly represents the population is a critical question for ensuring the validity of downstream research. Generally, this problem reduces to estimating the distance between two high-dimensional distributions, which typically requires a number of samples that grows exponentially with the dimension. However, depending on the model used for data analysis, the conclusions drawn from the data may remain consistent across different underlying distributions. In this context, we propose a task-based approach to assess the credibility of sampled surveys. Specifically, we introduce a model-specific distance metric to quantify this notion of credibility. We also design an algorithm to verify the credibility of survey data in the context of regression models. Notably, the sample complexity of our algorithm is independent of the data dimension. This efficiency…

Tables6

Table 1. Table 1: Sufficient Size of Survey Data

Hypothesis Class( $ℱ$ )	$ℱ_{1}$ (Lasso)	$ℱ_{2}$ (Ridge)	$ℱ_{𝐊}$ (Kernel)
Size of $S$ ( $ℳ (ℱ)$ )	$Ω (\frac{\log (d)}{ϵ^{2}})$	$Ω (\frac{d}{ϵ^{2}})$	$Ω (\frac{r^{2}}{ϵ^{2}})$ ⁴⁴4 $r^{2}$ is the upper bound of $\| 𝐊 (𝐱, 𝐱^{'}) \|$ .

Table 2. Table 2: Sufficient Size of Survey Data

Hypothesis Class( $ℱ$ )	$ℱ_{1}$ (Lasso)	$ℱ_{2}$ (Ridge)	$ℱ_{𝐊}$ (Kernel)
Size of $S$ ( $ℳ (ℱ)$ )	$Ω (\frac{\log (d)}{ϵ^{2}})$	$Ω (\frac{d}{ϵ^{2}})$	$Ω (\frac{r^{2}}{ϵ^{2}})$

Table 3. Table 3: Performance of SurVerify on Synthetic Data w.r.t. ℱ 2 \mathcal{F}_{2} [Figure 6 ]

$ϵ$	FDD	Acceptance Rate	#Avg. Samples Used	$τ$	Early Rejection Ratio
0.05	0.04	1	818	818	1
0.05	0.04	1	818	818	1
0.05	0.05	1	818	818	1
0.05	0.06	1	818	818	1
0.05	0.07	1	818	818	1
0.05	0.09	1	818	818	1
0.05	0.11	0.96	818	818	1
0.05	0.14	0.74	818	818	1
0.05	0.18	0.06	811	818	0.991
0.05	0.21	0	758	818	0.927
0.05	0.25	0	575	818	0.703
0.05	0.29	0	390	818	0.477
0.05	0.33	0	277	818	0.339
0.05	0.39	0	197	818	0.241
0.05	0.45	0	135	818	0.165
0.05	0.51	0	100	818	0.122
0.05	0.57	0	83	818	0.101
0.05	0.63	0	64	818	0.079
0.05	0.70	0	52	818	0.063
0.05	0.78	0	44	818	0.053
0.05	0.86	0	28	818	0.035
0.05	0.94	0	29	818	0.035
0.05	1.02	0	27	818	0.034
0.05	1.12	0	19	818	0.024
0.05	1.20	0	18	818	0.022
0.05	1.31	0	16	818	0.019
0.05	1.43	0	13	818	0.016
0.05	1.54	0	10	818	0.013
0.05	1.62	0	10	818	0.012
0.05	1.76	0	10	818	0.012
0.05	1.88	0	9	818	0.011

Table 4. Table 4: Performance of SurVerify on Synthetic Data w.r.t. ℱ 1 \mathcal{F}_{1} [Figure 6 ]

$ϵ$	FDD	Acceptance Rate	#Avg. Samples Used	$τ$	Early Rejection Ratio
0.05	0.03	1	818	818	1
0.05	0.04	1	818	818	1
0.05	0.04	1	818	818	1
0.05	0.05	1	818	818	1
0.05	0.07	1	818	818	1
0.05	0.09	1	818	818	1
0.05	0.11	1	818	818	1
0.05	0.13	0.96	818	818	1
0.05	0.16	0.66	818	818	1
0.05	0.20	0.1	809	818	0.989
0.05	0.24	0	726	818	0.888
0.05	0.28	0	506	818	0.618
0.05	0.33	0	342	818	0.418
0.05	0.38	0	225	818	0.276
0.05	0.43	0	177	818	0.216
0.05	0.50	0	122	818	0.150
0.05	0.56	0	89	818	0.109
0.05	0.62	0	68	818	0.083
0.05	0.70	0	54	818	0.066
0.05	0.78	0	51	818	0.063
0.05	0.85	0	36	818	0.044
0.05	0.93	0	27	818	0.032
0.05	1.02	0	21	818	0.026
0.05	1.10	0	21	818	0.025
0.05	1.21	0	16	818	0.020
0.05	1.31	0	14	818	0.017
0.05	1.43	0	14	818	0.017
0.05	1.51	0	13	818	0.016
0.05	1.63	0	11	818	0.013
0.05	1.75	0	9	818	0.011
0.05	1.87	0	9	818	0.011

Table 5. Table 5: Performance of SurVerify on ACS_Income w.r.t. ℱ 2 \mathcal{F}_{2} [Figure 7 ]

$ϵ$	FDD	Acceptance Rate	#Avg. Samples Used	$τ$	Early Rejection Ratio
0.05	0.0258	1	2195	2195	1.000
0.04	0.0258	0.98	3361	3430	0.980
0.03	0.0258	0.98	5975	6097	0.980
0.02	0.0258	1	13717	13717	1.000
0.0175	0.0258	1	17916	17916	1.000
0.015	0.0258	0.94	24386	24386	1.000
0.0125	0.0258	0.52	35115	35115	0.933
0.01	0.0258	0.12	53222	54867	0.858
0.009	0.0258	0	57022	67737	0.571
0.0075	0.0258	0	61388	97542	0.326
0.006	0.0258	0	52710	152409	0.200
0.005	0.0258	0	47414	219468	0.136
0.004	0.0258	0	48322	342919	0.072
0.003	0.0258	0	45773	609633	0.031
0.002	0.0258	0	44064	1371675	0.017
0.0015	0.0258	0	47085	2438532	0.008
0.001	0.0258	0	44301	5486697	0.007
0.0005	0.0258	0	42207	21946787	0.002

Table 6. Table 6: Performance of SurVerify on ACS_Income w.r.t. ℱ 1 \mathcal{F}_{1} [Figure 8 ]

$ϵ$	FDD	Acceptance Rate	#Avg. Samples Used	$τ$	Early Rejection Ratio
0.05	0.0259	1	2285	2285	1.000
0.04	0.0259	1	3570	3570	1.000
0.03	0.0259	0.98	6219	6346	0.980
0.02	0.0259	0.98	14279	14279	1.000
0.0175	0.0259	1	18650	18650	1.000
0.015	0.0259	0.98	25384	25384	1.000
0.0125	0.0259	0.76	36551	36553	1.000
0.01	0.0259	0.08	53262	57114	0.933
0.009	0.0259	0	60467	70511	0.858
0.0075	0.0259	0	57999	101535	0.571
0.006	0.0259	0	51660	158649	0.326
0.005	0.0259	0	45721	228454	0.200
0.004	0.0259	0	48408	356959	0.136
0.003	0.0259	0	45663	634593	0.072
0.002	0.0259	0	43893	1427833	0.031
0.0015	0.0259	0	42993	2538370	0.017
0.001	0.0259	0	41880	5711332	0.007
0.0005	0.0259	0	43961	22845325	0.002

Equations166

f_{S} ≜ argmin_{f \in F} \frac{1}{m} (x, y) \in S \sum L (f (x), y) .

f_{S} ≜ argmin_{f \in F} \frac{1}{m} (x, y) \in S \sum L (f (x), y) .

f^{*} ≜ argmin_{f \in F} (x, y) \sim D^{*} E L (f (x), y) .

f^{*} ≜ argmin_{f \in F} (x, y) \sim D^{*} E L (f (x), y) .

dist_{D} (f, g) ≜ x \sim D E [(f (x) - g (x))^{2}]

dist_{D} (f, g) ≜ x \sim D E [(f (x) - g (x))^{2}]

y = ⟨ θ^{*}, x ⟩ + η, where θ^{*}, x \in R^{d}, y, η \in R

y = ⟨ θ^{*}, x ⟩ + η, where θ^{*}, x \in R^{d}, y, η \in R

y = ⟨ θ^{*}, Φ (x) ⟩ + η where θ^{*}, Φ (x) \in H, y, η \in R

y = ⟨ θ^{*}, Φ (x) ⟩ + η where θ^{*}, Φ (x) \in H, y, η \in R

ℜ_{S} (G) ≜ r E [g \in G sup \frac{1}{m} i \in m \sum r_{i} g (z_{i})],

ℜ_{S} (G) ≜ r E [g \in G sup \frac{1}{m} i \in m \sum r_{i} g (z_{i})],

FDD_{D^{*}}^{F} (D_{1}, D_{2}) = dist_{D^{*}} (f_{D_{1}}, f_{D_{2}}) .

FDD_{D^{*}}^{F} (D_{1}, D_{2}) = dist_{D^{*}} (f_{D_{1}}, f_{D_{2}}) .

\operatorname*{\mathbb{E}}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}\left[{\left({f_{S}(\mathbf{x})-y}\right)^{2}}\right]={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{S},\mathcal{D}^{*}){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})^{2}}+\sigma_{\eta}^{2}.

\operatorname*{\mathbb{E}}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}\left[{\left({f_{S}(\mathbf{x})-y}\right)^{2}}\right]={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{S},\mathcal{D}^{*}){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})^{2}}+\sigma_{\eta}^{2}.

(x, y) \sim D E [L (f (x), y)] - \frac{1}{m} (x, y) \in S \sum L (f (x), y) \leq 2 μ ℜ_{S} (F) + 3 M \frac{lo g \frac{4}{δ}}{2 m}

(x, y) \sim D E [L (f (x), y)] - \frac{1}{m} (x, y) \in S \sum L (f (x), y) \leq 2 μ ℜ_{S} (F) + 3 M \frac{lo g \frac{4}{δ}}{2 m}

σ_{η}^{2} - \frac{1}{m} (x, y) \in S \sum (f_{S} (x) - y)^{2} \leq \frac{ϵ}{10},

σ_{η}^{2} - \frac{1}{m} (x, y) \in S \sum (f_{S} (x) - y)^{2} \leq \frac{ϵ}{10},

(x, y) \sim D^{*} E [(f_{S} (x) - y)^{2}]

(x, y) \sim D^{*} E [(f_{S} (x) - y)^{2}]

= D^{*}, η E [(f_{S} (x) - f^{*} (x) - η)^{2}]

= D^{*}, η E [(f_{S} (x) - f^{*} (x))^{2} + η^{2} - 2 (f_{S} (x) - f^{*} (x)) η]

= D^{*} E [(f_{S} (x) - f^{*} (x))^{2}] + η E [η^{2}] - 2 D^{*} E η E [(f_{S} (x) - f^{*} (x)) η]

= D^{*} E [(f_{S} (x) - f^{*} (x))^{2}] + σ_{η}^{2} + 2 D^{*} E [(f_{S} (x) - f^{*} (x)) η E [η]]

= D^{*} E [(f_{S} (x) - f^{*} (x))^{2}] + σ_{η}^{2}

= dist_{D^{*}}^{2} (f_{S}, f^{*}) + σ_{η}^{2}

ℜ_{S} (L \circ F) \leq μ ℜ_{S} (F)

ℜ_{S} (L \circ F) \leq μ ℜ_{S} (F)

∣ f (x_{1}, \dots, x_{i}, \dots, x_{m}) - f (x_{1}, \dots, x_{i}^{'}, \dots, x_{m}) ∣ \leq c, \forall x_{1}, \dots, x_{m}, x_{i}^{'} \in R^{d}

∣ f (x_{1}, \dots, x_{i}, \dots, x_{m}) - f (x_{1}, \dots, x_{i}^{'}, \dots, x_{m}) ∣ \leq c, \forall x_{1}, \dots, x_{m}, x_{i}^{'} \in R^{d}

Pr [∣ f (X_{1}, \dots, X_{m}) - E [f (X_{1}, \dots, X_{m})] ∣ \geq ϵ] \leq 2 exp (\frac{- 2 ϵ ^{2}}{m c ^{2}})

Pr [∣ f (X_{1}, \dots, X_{m}) - E [f (X_{1}, \dots, X_{m})] ∣ \geq ϵ] \leq 2 exp (\frac{- 2 ϵ ^{2}}{m c ^{2}})

ℜ_{m} (G) = S \sim D^{m} E [ℜ_{S} (G)]

ℜ_{m} (G) = S \sim D^{m} E [ℜ_{S} (G)]

\frac{1}{m} i \in [m] \sum g (z) - E [g (z)] \leq 2 ℜ_{S} (G) + 3 M \frac{lo g \frac{4}{δ}}{2 m}

\frac{1}{m} i \in [m] \sum g (z) - E [g (z)] \leq 2 ℜ_{S} (G) + 3 M \frac{lo g \frac{4}{δ}}{2 m}

Φ (S) = g \in G sup \hat{E}_{S} [g] - E [g]

Φ (S) = g \in G sup \hat{E}_{S} [g] - E [g]

S E [Φ (S)] =

S E [Φ (S)] =

=

\leq

=

=

\leq

=

=

Φ (S) - Φ (S^{'}) \leq g \in G sup (\hat{E}_{S} [g] - \hat{E}_{S^{'}} [g]) = g \in G sup \frac{g ( z _{i} ) - g ( z _{i}^{'} )}{m} \leq \frac{M}{m}

Φ (S) - Φ (S^{'}) \leq g \in G sup (\hat{E}_{S} [g] - \hat{E}_{S^{'}} [g]) = g \in G sup \frac{g ( z _{i} ) - g ( z _{i}^{'} )}{m} \leq \frac{M}{m}

Pr Φ (S) - E [Φ (S)] \geq M \frac{lo g \frac{4}{δ}}{2 m} \leq exp (\frac{- 2 M ^{2} m ^{2} lo g \frac{4}{δ}}{2 m ^{2} M ^{2}}) = \frac{δ}{4}

Pr Φ (S) - E [Φ (S)] \geq M \frac{lo g \frac{4}{δ}}{2 m} \leq exp (\frac{- 2 M ^{2} m ^{2} lo g \frac{4}{δ}}{2 m ^{2} M ^{2}}) = \frac{δ}{4}

Pr \frac{1}{m} i \in [m] \sum g (z) - E [g (z)] \geq 2ℜ (G) + M \frac{lo g \frac{4}{δ}}{2 m} \leq \frac{δ}{4}

Pr \frac{1}{m} i \in [m] \sum g (z) - E [g (z)] \geq 2ℜ (G) + M \frac{lo g \frac{4}{δ}}{2 m} \leq \frac{δ}{4}

ℜ_{S} (G) - ℜ_{S^{'}} (G) \leq

ℜ_{S} (G) - ℜ_{S^{'}} (G) \leq

Pr ℜ (G) - ℜ_{S} (G) \geq M \frac{lo g \frac{4}{δ}}{2 m} \leq \frac{δ}{4}

Pr ℜ (G) - ℜ_{S} (G) \geq M \frac{lo g \frac{4}{δ}}{2 m} \leq \frac{δ}{4}

Pr \frac{1}{m} i \in [m] \sum g (z) - E [g (z)] \geq 2 ℜ_{S} (G) + 3 M \frac{lo g \frac{4}{δ}}{2 m} \leq \frac{δ}{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Distributed Sensor Networks and Detection Algorithms · Statistical Methods and Inference

Full text

Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression

Debabrota Basu

Équipe Scool, Univ. Lille, Inria,

CNRS, Centrale Lille, UMR 9189- CRIStAL

F-59000 Lille, France

Sourav Chakraborty

Indian Statistical Institute

Kolkata, India

Debarshi Chanda

Indian Statistical Institute

Kolkata, India

Buddha Dev Das

Indian Statistical Institute

Kolkata, India

Arijit Ghosh

Indian Statistical Institute

Kolkata, India

Arnab Ray

Indian Statistical Institute

Kolkata, India

Abstract

Assessing whether a sample survey credibly represents the population is a critical question for ensuring the validity of downstream research. Generally, this problem reduces to estimating the distance between two high-dimensional distributions, which typically requires a number of samples that grows exponentially with the dimension. However, depending on the model used for data analysis, the conclusions drawn from the data may remain consistent across different underlying distributions. In this context, we propose a task-based approach to assess the credibility of sampled surveys. Specifically, we introduce a model-specific distance metric to quantify this notion of credibility. We also design an algorithm to verify the credibility of survey data in the context of regression models. Notably, the sample complexity of our algorithm is independent of the data’s dimension. This efficiency stems from the fact that the algorithm focuses on verifying the credibility of the survey data rather than reconstructing the underlying regression model. Furthermore, we show that if one attempts to verify credibility by reconstructing the regression model, the sample complexity scales linearly with the dimensionality of the data. We prove the theoretical correctness of our algorithm and numerically demonstrate our algorithm’s performance.

\doparttoc\faketableofcontents

1 Introduction

Socio-economic surveys are conducted globally to collect data on population characteristics for a variety of purposes, including demographic and economic analyses, educational planning, poverty assessments, exit poll evaluations, and measuring progress toward national goals (GFJC*+, 11; KKM+, 21). The primary aim of many surveys is to support inference-driven analyses that uncover patterns to inform future research and policy decisions (HWB, 17; GoC, 24), as well as to monitor and evaluate the long-term impacts of various policies (BDI+*, 20). These survey datas serve as long-term benchmarks for validating research hypotheses (SD, 94; GoC, 24). Therefore, verifying the credibility of such survey data is essential to ensure the validity of downstream analyses.

Ideally, properly collected data should be a faithful representation of the population, and representative data should ensure the validity of subsequent research. However, in practice, survey data rarely reflect the population perfectly (Mau, 17; IK, 20). In the social sciences, it is rare to find large-scale surveys that do not employ stratified or multistage sampling techniques (GFJC*+*, 11; Loh, 21; Kal, 21). In practice, these surveys are often carried out under logistical constraints.

Determining whether a collected sample accurately represents the population is a longstanding challenge in both statistics and computer science—often framed in the latter as the problem of measuring the closeness between two distributions (BFR*+*, 00; Can, 22). In other words, verifying representativeness is inherently inefficient and resource-intensive. In many cases, data collectors do not even claim that their samples are representative. Nevertheless, such data are routinely used for population-level research. Naturally, this raises the question of how much trust one can place in the resulting analyses. The answer hinges on the “credibility” of the data. In this paper, we propose a principled approach to quantify the credibility of survey data, along with an efficient method for doing so.

A key observation is that if the goal is merely to ensure the validity of research conducted using the data, then verifying whether the data is fully representative of the population may be unnecessary—or even excessive. In such cases, traditional methods for assessing representativeness may be too rigid or resource-intensive to be practical. Specifically, if the analysis relies on a well-established class of inference tools, we should be able to certify that any conclusions drawn using these tools from the given survey are valid, regardless of whether the data perfectly mirrors the population.

One widely used and interpretable method for analyzing survey data is fitting a regression model. For example, BJ (08) utilizes data from the British Health and Lifestyle Survey (1984-1985) and its longitudinal follow-up in May 2003 to demonstrate a strong association between mortality and socio-economic status. Motivated by such applications, in this paper, we ask the following question:

Can we verify whether the conclusions drawn from a regression model fitted on a given survey dataset would yield similar results if applied to the entire population?

Conducting large-scale sample surveys is often complex and costly, which can result in compromised data quality. However, it is commonly assumed that collecting a small number of additional high-quality data points can help validate the overall dataset. Building on this idea, our approach to the question above involves leveraging a limited amount of high-quality supplementary sample—alongside the original survey data—to assess the credibility of the survey in the context of regression models. The central objective is to develop an efficient algorithm that minimizes both computational cost and sample complexity (i.e., the number of additional samples required).

Problem Formulation. Typically, once a sampling-based study is designed, survey data is collected from an underlying population. In line with the structure of most socio-economic surveys, we assume that the survey dataset $S$ consists of tabular numeric covariates and a scalar response variable. Specifically, each data point in $S$ is of the form $(\mathbf{x},y)$ , where the covariates $\mathbf{x}\in\mathbb{R}^{d}$ and the response variable $y\in\mathbb{R}$ . Most of the time the dimension, that is $d$ , is quite large.

We denote by $\mathcal{D}^{*}$ the distribution of the $(\mathbf{x},y)$ tuples of the whole population. If the dataset $S$ was obtained after perfect sampling techniques, i.e. by drawing independent samples from an unknown distribution $\mathcal{D}^{*}$ , then one would call the survey data $S$ to be a credible representation of the population. But due to various limitations, the dataset $S$ collected might be obtained by drawing samples from some other distribution $\mathcal{D}_{S}$ . So the question about how credible is $S$ as a representation of the population boils down to understanding the distance between the two distributions $\mathcal{D}^{*}$ and $\mathcal{D}_{S}$ . We will call $\mathcal{D}^{*}$ to be the true distribution and $\mathcal{D}_{S}$ to be the sample distribution. Estimating the distance between two high-dimensional distributions is very inefficient, and hence, impractical (Can, 15, 22). This has motivated development of distance measures between datasets, such as Optimal Transport Dataset Distance (AMF, 20), which are costly to compute in high dimensions.

In particular, we list the sample complexities of some of the most well-studied distributional distances when the distributions are defined over a $d$ -dimensional space:

•

TV: The problem of testing the TV distance of two distributions over a support of size $k$ require $\Theta(k/\log k)$ samples (CJKL, 22). Given the distribution is over $\{0,1\}^{d}$ , the sample complexity is $\Theta(2^{d}/d)$ . If we have a continuous distributions over $[0,1]^{d}$ discretized with bin width $\varepsilon$ , the sample complexity would be $\Theta(\varepsilon^{d}/(d\log(1/\varepsilon)))$ .

•

Wasserstein: For two bounded-moment distributions over a $d$ -dimensional space, the Wasserstein distance requires $\Omega(\epsilon^{-d})$ samples for the empirical measure to converge to distance $\epsilon$ (Lei, 20).

•

KL: For two distributions over a $d$ -dimensional bounded space, the minimax-optimal estimation of the KL divergence requires $\Omega(\varepsilon^{-d})$ samples (ZL, 20).

In all the cases discussed above, to test closeness of distributions given sampling access to them, requires the number of samples to grow exponentially with the number of dimensions. In contrast, the number of samples-to-test required by our method is independent of dimension.

Samples collected from a survey are typically used for various data interpretation and deduction tasks, e.g. regression, classification etc. In all these cases, one aims to find a model from a given model class, say $\mathcal{F}$ , that minimises a task-specific loss function. For example, for regression, we aim to find the regression function that minimise the square loss over the survey data. If ${\mathcal{L}}:{\mathbb{R}^{2}}\rightarrow{\mathbb{R}}$ is the loss function, then the model learnt from the survey set $S$ is:

[TABLE]

To validate the credibility of a survey data, we propose to test whether the model $f_{S}$ derived from the survey data $S$ matches the model $f^{*}$ , that would have been odtained if the dataset $S$ been a credible representation of the population $\mathcal{D}^{*}$ .

[TABLE]

We will assume that we have access to a small sample set, called the validation dataset, obtained by drawing i.i.d. samples from the true distribution $\mathcal{D}^{*}$ .

Depending on the problems, different metrics have been proposed to quantify the closeness of distributions (GS, 02). Our goal is to validate the quality of the survey data $S$ by estimating the distance of $f_{S}$ from $f^{*}$ . We use the distributional $\ell_{2}$ distance to quantify the closeness of regression models.

Definition 1 (Distributional $\ell_{2}$ -Distance between Functions).

Let $f$ and $g$ be real-valued functions on $\mathbb{R}^{d}$ , and $\mathcal{D}$ be a distribution on $\mathbb{R}^{d}$ . The distributional $\ell_{2}$ -distance between $f$ and $g$ on $\mathcal{D}$ is:

[TABLE]

Thus, our problem can be formulated as follows: Given a survey set $S$ (drawn according to some unknown distribution $\mathcal{D}_{S}$ ) and a model class $\mathcal{F}$ , we aim to sample a small number of new data points from the true distribution $\mathcal{D}^{*}$ and determine whether $\texttt{dist}_{\mathcal{D}^{*}}(f_{S},f^{*})$ lies within a specified acceptable threshold. Ideally, the number of new samples drawn from $\mathcal{D}^{*}$ should be very small and independent of the dimensionality of the ambient space.

Related Works. Our work lies at the intersection of distribution testing and model validation. Distribution identity testing—determining whether an unknown distribution matches a known one—has been widely studied BFR*+* (00); Pan (08); VV (17); DGPP (18), with comprehensive surveys summarizing key results Can (22, 15). Recent efforts have focused on high-dimensional settings, where testing structured distributions such as Ising models or Bayesian networks poses significant challenges DP (17); DDK (18); CDKS (17); BGMV (20); BGKV (21). However, these approaches often suffer from exponential sample complexity in the dimension $d$ BBC*+* (20); BCvV (21). In contrast, model validation has long been studied through statistical tests for evaluating model fit, especially in regression and parametric models Sne (77); PC (84); DM (98); SZ (21); Stu (97). These approaches often rely on strong assumptions about the model or the data. Our work brings these two perspectives together aiming to develop scalable and principled methods for validating the credibility of high dimensional surveys through the lens of regression models.

Our Results. In this work, we consider the class of regression models for the model-specific testing problem. We consider two common assumptions of regression models for our scenario – exogenous noise in observation (RL, 03; MRT, 18) and boundedness of involved variables and the model (MRT, 18; JWHT, 21).111Note that these two assumptions are not absolutely necessary for the proposed framework to function but to provide clean and rigorous theoretical analysis. We discuss further in Section 6. Exogeneity of noise ensures exact identifiability of the underlying model, i.e. we do not have unidentified covariates that influence the outcome. Boundedness is usually satisfied in our setting as the survey datasets always have finite entries and can be normalized.

Assumption 1 (Exogenous Noise).

For a regression model $y=f(\mathbf{x})+\eta$ , we have:

(a) Homoskedasticity: The noise $\eta$ has constant variance, i.e. $\operatorname*{\mathrm{Var}}[\eta\mid\mathbf{x}]=\sigma_{\eta}^{2}$ ,

(b) Non-correlation: The noise $\eta$ is uncorrelated with $\mathbf{x}\in\mathbb{R}^{d}$ and independent across observations.

Assumption 2 (Boundedness).

We assume that the response variable satisfy $\left|{y}\right|\leq 1$ , the covariates satisfy $\left\|{\mathbf{x}}\right\|_{\infty}\leq 1$ , and $f(\mathbf{x})\leq 1$ .

Given this context, we elaborate the main contributions of this paper:

Task-Specific Credibility Testing: We propose the framework of task-specific credibility testing of survey that checks whether it leads to valid inference while used with ML models. Specifically, we focus on regression models – linear with $\ell_{1}$ and $\ell_{2}$ regularizers, and kernel with $\ell_{2}$ regularizers. This is a deviation from the classic distribution testing frameworks that check for some divergence (e.g. TV, KL, Wasserstein) between two data distributions. But these frameworks require exponential number of samples with respect to the dimension of data. This is infeasible for a survey setting. Thus, we propose a new data-distribution specific metric, called the Functional Distance of Distributions (FDD), between two regression model, and leverage it to test closeness of two data distributions through the lens of regression.
Generic Algorithm for Model-Specific Testing for Regression Models: We propose SurVerify to test whether a regression model learned from a given survey data $S$ is close to a model learned using independent and identically distributed (i.i.d.) samples collected from an underlying distribution. SurVerify does this by checking whether the loss of the survey-based model and the i.i.d. model match up to pre-computed threshold. We prove that SurVerify is correct with high probability up to a user-defined tolerance gap. We show that the worst-case sample complexity222Sample complexity is the number of sample-to-test the SurVerify needs from true distribution $\mathcal{D}^{*}$ . of SurVerify to conduct a correct test is independent of the dimension and fixed across regression models. Additionally, if the model is very far in the FDD metric, SurVerify detects it earlier with less samples. Finally, we numerically verify the correctness and sample complexity of SurVerify across datasets.

To conduct our theoretical analysis, we propose a new two-sided bound on generalization error of a regression model, which is of independent interest for statistical learning.

Organization of the paper:

Section 2 introduces the preliminaries. Section 3 discusses the new metric. Section 4 presents our main algorithm, SurVerify, with theoretical guarantees. Proofs appear in the Appendix. Section 5 reports experimental results.

2 Preliminaries: Regression Models and Rademacher Complexity

The survey set is denoted as $S$ , and its size as $m=\left|{S}\right|$ . We denote $\mathcal{X}$ , and $\mathcal{Y}$ to be the input and output spaces, respectively. $\mathcal{F}$ denotes a hypothesis sets consisting of hypothesis ${f}:{\mathcal{X}}\rightarrow{\mathcal{Y}}$ . Similarly, $\mathcal{F}$ denotes the set of regression functions ${f}:{\mathcal{X}}\rightarrow{\mathcal{Y}}$ , and the coefficient associated with the regression functions are denoted $\boldsymbol{\theta}$ . $\left\langle{\cdot}\,,\,{\cdot}\right\rangle$ denotes inner product, and $\left\|{\cdot}\right\|_{p}$ denotes the $\ell_{p}$ norm.

A Primer on Regression: Linear and Kernel.

Performing regression on survey data to fit reasonable models over the population is central to a wide variety of analysis tasks (CGG*+*, 15; Pan, 17; MS, 17). Often, the observations collected to construct a survey dataset are the result of a complex sampling design reflecting the need to collect data as efficiently as possible within cost constraints.

Broadly, the problem of regression is as follows: given an input space $\mathcal{X}\subseteq\mathbb{R}^{d}$ , an output range $\mathcal{Y}\subseteq\mathbb{R}$ , a distribution over $\mathcal{X}\times\mathcal{Y}$ , a hypothesis set ${\mathcal{F}}:{\mathcal{X}}\rightarrow{\mathcal{Y}}$ , and a loss function ${\mathcal{L}}:{\mathcal{Y}\times\mathcal{Y}}\rightarrow{\mathbb{R}}$ , output a hypothesis $h\in\mathcal{F}$ that minimizes loss w.r.t. the distribution over $\mathcal{X}\times\mathcal{Y}$ . Here, we consider the regression model with additive noise $\eta$ . That is, $y=f(\mathbf{x})+\eta$ .

In this work, we consider three widely used hypothesis classes for the regression problem. First, we consider linear regression that tries to fit a linear model between the response and the covariates, i.e.

[TABLE]

We consider both the cases of $\ell_{1}$ and $\ell_{2}$ -norm bounded coefficients for the linear regression model, known as Lasso and Ridge regression respectively. These are also called the bounded weight hypothesis classes $\mathcal{F}_{p}=\{\mathbf{x}\rightarrow\left\langle{\boldsymbol{\theta}}\,,\,{\mathbf{x}}\right\rangle:\left\|{\boldsymbol{\theta}}\right\|_{p}\leq 1\}.$ Henceforward, we use these two terms interchangeably. We denote the hypothesis sets containing $\ell_{1}$ and $\ell_{2}$ bounded linear regressions as $\mathcal{F}_{1}$ , and $\mathcal{F}_{2}$ , respectively.

We also consider the Kernel Regression model, where we associate with the input space $\mathcal{X}$ a PDS (Positive Semidefinite Symmetric) kernel ${\mathbf{K}}:{\mathcal{X}\times\mathcal{X}}\rightarrow{\mathbb{R}}$ that implicitly defines an associated function ${\mathbf{\Phi}}:{\mathcal{X}}\rightarrow{\mathbb{H}}$ such that: $\mathbf{K}\left({\mathbf{x},\mathbf{x}^{\prime}}\right)=\left\langle{\mathbf{\Phi}(\mathbf{x})}\,,\,{\mathbf{\Phi}(\mathbf{x}^{\prime})}\right\rangle$ . The regression model is a linear model on this Hilbert space $\mathbb{H}$ with the underlying coefficients $\boldsymbol{\theta}^{*}\in\mathbb{H}$ , and the model is:

[TABLE]

In this case, we consider the hypothesis class consisting of coefficients $\boldsymbol{\theta}$ with bounded $\mathbb{H}$ -norm. We denote the hypothesis classes containing the kernel as $\mathcal{F}_{\mathbf{K}}$ . For all the regression models, we consider the loss function $\mathcal{L}$ to be the squared error loss function defined as $\mathcal{L}(y,y^{\prime})\triangleq(y-y^{\prime})^{2}$ .

Rademacher Complexity.

The Rademacher complexity of a function class $\mathcal{G}$ plays a crucial role in the generalization bounds for several learning models MRT (18), and also in our analysis. The empirical Rademacher complexity is measured w.r.t. a particular set of samples $S$ .

Definition 2 (Empirical Rademacher Complexity).

Given a family of functions $\mathcal{G}$ containing functions ${g}:{\mathcal{Z}}\rightarrow{\left[{0,M}\right]}$ and $S=\left({z_{1},\ldots,z_{m}}\right)$ a fixed sample with elements in $\mathcal{Z}$ . Then, the empirical Rademacher complexity $\widehat{\Re}_{S}\left({\mathcal{G}}\right)$ of $\mathcal{G}$ w.r.t. $S$ is

[TABLE]

where $r_{i}$ ’s are i.i.d Rademacher random variables taking value uniformly in $\left\{{-1,+1}\right\}$ .

3 Functional Distance of Distributions (FDD): A Novel Metric

We define the model-specific distance between distributions that quantifies the distance between distributions w.r.t. a model class $\mathcal{F}$ and a true distribution $\mathcal{D}^{*}$ .

Definition 3 ( $\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{1},\mathcal{D}_{2})$ ).

Given a true distribution $\mathcal{D}^{*}$ , a model class $\mathcal{F}$ , and an associated loss function $\mathcal{L}_{\mathcal{F}}$ , let $f_{\mathcal{D}_{1}}$ , and $f_{\mathcal{D}_{2}}$ be the optimal models in $\mathcal{F}$ for $\mathcal{D}_{1}$ , and $\mathcal{D}_{2}$ , respectively. We define the model-specific distance w.r.t. the true distribution $\mathcal{D}^{*}$ as:

[TABLE]

Given a true distribution $\mathcal{D}^{*}$ , the model specific testing transforms the problem of testing closeness of distributions to testing closeness of functions over a given true distribution. Given a hypothesis set $\mathcal{F}$ , and a loss function $\mathcal{L}$ , it associates with each distribution $\mathcal{D}$ a function $f_{\mathcal{D}}$ as $f_{\mathcal{D}}=\mathop{\mathrm{argmin}}_{f\in\mathcal{F}}\operatorname*{\mathbb{E}}_{\mathbf{x},y\sim\mathcal{D}}\mathcal{L}\left({f(\mathbf{x}),y}\right)$ .

Consequently, given a set of distributions $\mathbfcal{D}$ , we can define the set of hypotheses associated with them as $\mathcal{F}{D}=\left\{{f_{\mathcal{D}}\in\mathcal{F}\mid\mathcal{D}\in\mathbfcal{D}}\right\}$ .

By a standard fact of $\texttt{L}^{p}$ spaces SS (12), if the functions $f\in\mathcal{F}{D}$ has bounded second moment w.r.t. $\mathcal{D}^{*}$ , i.e. $\operatorname*{\mathbb{E}}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}\left[{f^{2}(\mathbf{x})}\right]<\infty$ , then the set $\left({\mathcal{F}{D},\texttt{dist}_{\mathcal{D}^{*}}{}{}}\right)$ constitutes a $\texttt{L}^{2}$ -space. If we consider the equivalence relation $f_{1}\sim f_{2}$ , i.e., if and if $\texttt{dist}_{\mathcal{D}^{*}}(f_{1},f_{2})=0$ , $\texttt{dist}_{\mathcal{D}^{*}}{}{}$ defines a metric on the resulting partition. Correspondingly, $\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}$ induces a metric on the partition of distributions $\mathbfcal{D}$ induced by the equivalence relation $\mathcal{D}_{1}\sim\mathcal{D}_{2}$ if and only if $\texttt{dist}_{\mathcal{D}^{*}}\left({f_{\mathcal{D}_{1}},f_{\mathcal{D}_{2}}}\right)=0$ .

It is important to note that the FDD metric can be zero even when the distributions $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ differ significantly. Therefore, when the goal is to assess whether two distributions are equivalent with respect to a specific task, FDD serves as an appropriate measure. For regression models that satisfy the exogenous noise assumption (Assumption 1) under the squared loss, we establish the following relationship between the loss and FDD.

{restatable}

[FDD-variance Decomposition of Loss]lemmadistanceLemma If the model class $\mathcal{F}$ satisfies Assumption 1, then

[TABLE]

The above lemma can be intuitively viewed as a decomposition result, akin to the classic bias-variance breakdown of estimation error Was (04). It states that the expected loss of a model learned from the survey, evaluated with respect to the true distribution, can be decomposed into two components: the approximation error (i.e., how far the learned model is from the optimal one) and the intrinsic noise (i.e., the error incurred even by the best possible model).

4 SurVerify: Testing Credibility with Regression and Fixed Confidence

We first describe the algorithm design and then establish its efficiency in terms of sample complexity. In order to prove this result, we propose a two-sided generalization bound for regression and also a lower bound on methods reconstructing complete model to test dataset distances.

4.1 Dimension Agnostic Algorithm Design with Early Stopping

We now present our algorithmic framework, SurVerify, which verifies whether a regression model learned from a survey sample $S$ is close to the true optimal model in $\ell_{2}$ -distance (Definition 1). The algorithm performs this testing using a small number of samples drawn from the true distribution $\mathcal{D}^{*}$ . We refer to them as sample-to-test.

{restatable}

algorithmsurverify SurVerify( $S\subset\mathbb{R}^{(d+1)},\mathcal{D}^{*},\epsilon,\delta,\mathcal{F}$ )

1: $\left|{S}\right|\geq\mathcal{M}(\mathcal{F})$

2:Initialize $m\leftarrow|S|$ , $S_{\mathcal{D}^{*}}\leftarrow\emptyset,\tau\leftarrow\lceil\frac{2}{(1.9\epsilon)^{2}}\log\left({\frac{3}{\delta}}\right)\rceil$

3: $f_{S}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}}\frac{1}{m}\sum_{(\mathbf{x},y)\in S}(f(\mathbf{x})-y)^{2}$

4: $\hat{L}_{S}\leftarrow\frac{1}{m}\sum_{(\mathbf{x},y)\in S}(f_{S}(\mathbf{x})-y)^{2}$

5: $\hat{\gamma}\leftarrow 0,t\leftarrow 0$

6:while $t<\tau$ do

7: $S_{\mathcal{D}^{*}}\leftarrow S_{\mathcal{D}^{*}}\cup\{(\mathbf{x}_{i},y_{i})\}$ , where $(\mathbf{x}_{i},y_{i})\sim\mathcal{D}^{*}$

8: $\hat{\gamma}\leftarrow\hat{\gamma}+(f_{S}(\mathbf{x}_{i})-y_{i})^{2}$

9: if $\hat{\gamma}-t\hat{L}_{S}>1.1t\epsilon+\sqrt{2t\log\left({\frac{3\tau}{\delta}}\right)}$ then

10: return REJECT

11: $t\leftarrow t+1$

12:if $\hat{\gamma}-\tau\hat{L}_{S}\leq 3\tau\epsilon$ then

13: return ACCEPT

14:else

15: return REJECT

We begin with an overview of our algorithmic framework, SurVerify, before presenting its formal correctness guarantee. The core idea behind SurVerify is to assess the credibility of a survey sample $S$ through a two-phase procedure. In the first phase (Lines 3 and 4), the algorithm fits a regression model $f_{S}$ using the survey data. In the second phase (Lines 6 to 15), it evaluates the reliability of $f_{S}$ by estimating its expected loss under the true distribution $\mathcal{D}^{*}$ , using a small number of i.i.d. samples-to-test. Specifically, it computes an additive estimate $\hat{\gamma}$ of the expected loss of $f_{S}$ on data from $\mathcal{D}^{*}$ . The algorithm then compares $\hat{\gamma}$ against a fixed threshold: if the estimated loss is low enough (Line 12 and onward), it outputs ACCEPT; otherwise, it outputs REJECT.

To be more sample-efficient, SurVerify also incorporates an early rejection criterion (Line 9) to terminate the evaluation of $f_{S}$ quickly when it incurs a large loss on the sample-to-test, i.e., when the loss is deviating enough to be detected with only a few samples. Notably, the total number of samples-to-test required from $\mathcal{D}^{*}$ , denoted $\tau$ , is $O\left(\frac{1}{\epsilon^{2}}\log\left(\frac{1}{\delta}\right)\right)$ , and is independent of the data dimension. This sample efficiency makes SurVerify well-suited for high-dimensional settings where direct access to the true distribution is limited, and also in the settings where collecting samples is costly (e.g. medical data).

4.2 Theoretical Analysis: Correctness, Sample Complexity, and Sufficient Size of Survey

The following theorem is the main structural result of this work. It shows that the validity of a model learned from survey data can be efficiently certified using only a small number of i.i.d. samples-to-test from the true distribution. This is especially useful when survey data is abundant but access to the true distribution is limited (e.g. medical data, socioeconomic data). By leveraging the framework of functional distance of distributions (defined in Section 3), SurVerify reliably distinguishes between two datasets with high confidence and low sample complexity.

{restatable}

[Correctness of SurVerify and Sample Complexity]theoremsurverifyCorrectness

Given a survey sample $S$ (drawn from an unknown distribution $\mathcal{D}_{S}$ ), a model class $\mathcal{F}$ and i.i.d. sampling access to the true distribution $\mathcal{D}^{*}$ then for any $\epsilon$ and $\delta\in(0,1)$ , if the size of $S$ is large enough (Table 1) then

If ${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{S},\mathcal{D}^{*}){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})^{2}}\leq\epsilon$ , then SurVerify outputs ACCEPT with probability $1-\delta$ .

2.

If ${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{S},\mathcal{D}^{*}){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})^{2}}>5\epsilon$ , then SurVerify outputs REJECT with probability $1-\delta$ .

Also, SurVerify requires at most $\lceil\frac{2}{(1.9\epsilon)^{2}}\log\left({\frac{3}{\delta}}\right)\rceil$ samples from $\mathcal{D}^{*}$ for validation.

Discussions: 1. Dimension Agnostic Tester: One of the interesting aspect of the above theorem is the fact that the sample complexity of SurVerify is independent of dimension. This efficiency stems from the fact that the algorithm focuses on verifying the credibility of the survey data rather than reconstructing the underlying regression model.

Relaxing Purity of Samples: At first glance, Theorem 4.2 may seem limited in practical applicability, as it assumes access to the true distribution $\mathcal{D}^{*}$ . However, in real-world settings, we typically have access only to a distribution $\mathcal{D}^{\prime}$ that is close to $\mathcal{D}^{*}$ , for instance in total variation distance. Fortunately, since the sample complexity of SurVerify is $O(1)$ for fixed $\epsilon$ and $\delta$ , the algorithm remains effective in this approximate setting. By appropriately adjusting the tolerance and confidence parameters to account for the discrepancy between $\mathcal{D}^{*}$ and $\mathcal{D}^{\prime}$ , we can still guarantee the correctness of the testing procedure. This robustness follows directly from the Data Processing Inequality PW (25).
Dealing with Regression Models on a Subset of Dimensions: Oftentimes, broad survey data is used for various downstream tasks involving projections onto a small number of dimensions. However, the FDD metric is not robust to arbitrary projections—closeness between entire datasets does not necessarily imply closeness under such projections. In these cases, the only reliable approach is to run SurVerify on the projected dimensions. Fortunately, the same sample from $\mathcal{D}^{*}$ can be reused across multiple projection-based checks.
Fixing $\epsilon$ , and $\delta$ in practice: The choice of $\delta$ is generally taken within the range of $[0.01,0.1]$ in practice. Although due to the fact that the dependence of sample complexity on $\delta$ is logarithmic, choosing a lower value does not impact the sample complexity much. The tolerance parameter $\varepsilon$ should be chosen according to the confidence required w.r.t. the underlying noise $\eta$ . Given the fact that testing w.r.t. a lower $\varepsilon$ does not cause an increase in sample complexities in practice, one strategy may be to test it with lower value of $\varepsilon$ and obtain a (constant factor) estimate to the FDD using SurVerify. If there is a fixed number of samples to test with, the strategy should be to fix the $\varepsilon$ level theoretically attainable according to Theorem 4.2.

Requirement: Sufficient Size of the Survey Data.

We show the following two-sided generalization bound of a general hypothesis class using the empirical Rademacher complexity. {restatable}[Generic Two-sided Generalization Bound]theoremtwoSidedGeneralizationBound

Given a hypothesis set $\mathcal{F}$ containing functions ${f}:{\mathcal{X}}\rightarrow{\mathcal{Y}}$ , and a $\mu$ -lipschitz333As per the standard nomenclature, a loss function $\mathcal{L}:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}$ is called $\mu$ -Lipschitz if for any fixed $y\in\mathbb{R}$ and $y_{1},y_{2}\in\mathbb{R}$ , we have $|\mathcal{L}(y_{1},y)-\mathcal{L}(y_{2},y)|\leq\mu|y_{1}-y_{2}|$ . loss function ${\mathcal{L}}:{\mathcal{Y}\times\mathcal{Y}}\rightarrow{\left[{0,M}\right]}$ . Let $S$ be a sample set of size $m\geq 1$ drawn as i.i.d. samples from the distribution $\mathcal{D}$ , then we have with probability at least $1-\delta$ :

[TABLE]

Note an upper bound in the generalization bound can be found in the following textbook MRT (18). We extend this to a two-sided bound controlling both under and overestimation. This is particularly important since we aim to design a tolerant tester.

Note that computuing empirical Rademacher complexity $\widehat{\Re}_{S}\left({\mathcal{F}}\right)$ is known to be computationally hard for general hypothesis classes FH (23); MR (18). However, for a bounded weight linear and kernel basel class, the $\widehat{\Re}_{S}\left({\mathcal{F}}\right)$ admits tight analytical bounds (see AFM (20); MRT (18)). We use this fact together with Theorem 4.2 to bound the sample size needed for estimating the noise variance.

The following result gives the size of the survey data needed for estimating $f_{S}$ for Lasso, Ridge and Kernel hypothesis classes.

{restatable}

[Minimum Survey Size for Learning Noise Variance]lemmasampleComplexityNoiseVariance Given a survey $S$ of size $m$ which is sufficiently large for their respective linear hypothesis classes (see Table 1). If Assumptions 1 and 2 hold, then with probability at least $1-{\delta}/{3}$ we have

[TABLE]

where $f_{S}\triangleq\mathop{\mathrm{argmin}}_{f\in\mathcal{F}}\frac{1}{m}\sum_{(\mathbf{x},y)\in S}(f(\mathbf{x})-y)^{2}$ . Note that Table 1 gives the sufficient survey data size from distribution for $\mathcal{D}_{S}$ for Lasso, Ridge and Kernel hypothesis classes.

Remark 4 ( $s$ -Sparse Linear Regression).

For the hypothesis class $\mathcal{F}_{1}$ (Lasso), one might be interested in $s$ -sparse linear regression, In that case we consider the coefficient vector $\boldsymbol{\theta}$ to be $s$ -sparse and the hypothesis class is defined by $\{\mathbf{x}\rightarrow\left\langle{\boldsymbol{\theta}}\,,\,{\mathbf{x}}\right\rangle:\left\|{\boldsymbol{\theta}}\right\|_{1}\leq 1,\left\|{\boldsymbol{\theta}}\right\|_{0}\leq s\}$ . Given survey data of size $m=\Omega\left({\frac{\log(s)}{\epsilon^{2}}}\right)$ from the distribution. If Assumptions 1 and 2 hold, then with probability at least $1-\delta$ , we have $\left|{\sigma_{\eta}^{2}-\frac{1}{m}\sum_{(\mathbf{x},y)\in S}(f_{S}(\mathbf{x})-y)^{2}}\right|\leq\frac{\epsilon}{10}$ .

Discussion: Relation to Out-Of-Distribution (OOD) Generalization. The OOD generalization literature assumes an intrinsic model can be learned across distributions, i.e. the performance of the learned hypothesis generalizes well to OOD data (Assumptions A–D in (LSH*+*, 23)). Our mechanism, on the other hand, works on the case where sampling from a different distribution results in a different model being learned. In other words, if there is an intrinsic model that can be learned across distributions, the distance $\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{S},\mathcal{D}^{*})$ for the model class $\mathcal{F}$ would be [math] for all distributions $\mathcal{D}_{S}$ and $\mathcal{D}^{*}$ . However, if that is not the case, we would efficiently detect whether the model learned from the survey distribution $\mathcal{D}_{S}$ generalizes well to the true distribution $\mathcal{D}^{*}$ .

4.3 Lower Bound on Sample Complexity: Advantage of Not Reconstructing the Model

SurVerify tests the model-specific credibility of a given sample survey without reconstructing the model itself. The fact that we don’t reconstruct the model helps us to ensure that the sample complexity is independent of the dimension. The following lemma proves that the number of samples that any algorithm that reconstructs the model to estimate model-specific distance needs grows linearly with dimension.

{restatable}

[Lower Bound on Testing with Model Reconstruction]lemmaLBReconst Under Assumption 2, and when $\lambda_{\min}\left({\mathrm{Cov}\left({\mathbf{x}}\right)}\right)\geq\lambda_{\min}{}$ , any algorithm that reconstructs the model to estimate the distance $\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{S},\mathcal{D}^{*}))$ within $\epsilon$ additive error must make $\Omega\left({{\frac{d\lambda_{\min}\sigma_{\eta}^{2}}{\epsilon^{2}}}}\right)$ queries.

Furthermore, if $\mathcal{D}_{S}$ and $\mathcal{D}^{*}$ are two distributions such that their respective loss distributions are subgaussian distributions with same variance but the means differ by $\epsilon$ , then $\texttt{FDD}_{\mathcal{D}^{*}}(\mathcal{D}_{S},\mathcal{D}^{*})=\epsilon$ (by Lemma 3). Since distinguishing between the two such subgaussian distributions requires $\Omega(1/\epsilon^{2})$ we observe that the sample complexity of SurVerify is tight in terms of dependence on $\epsilon$ .

5 Experimental Analysis

In this section, we empirically verify whether our tester SurVerify performs as per the theoretical analysis. In particular, we are interested in the following research questions:

RQ1. Does SurVerify yield accept when the survey data $S$ is close to being a credible dataset with respect to the model class, and likewise, does SurVerify indeed reject when $S$ is far from being credible? Specifically, how does the acceptance rate of SurVerify change as the the distance between the survey set $S$ and the true distribution $\mathcal{D}^{*}$ , and the tolerance parameter change?

RQ2. How many i.i.d. samples-to-test from the true distribution $\mathcal{D}^{*}$ does SurVerify require to certify if the survey data $S$ is credible? While the theoretical guarantee is for the worst-case runtime of SurVerify, we would like to check if SurVerify can reject a far from credible survey data $S$ with much less number of sample-to-test.

Experimental Setup. We implement all the algorithms in Python 3.10 and use LinearRegression from scikit-learn to learn $f_{S}$ . We run our simulations on Google Collaboratory with 2 Intel(R) Xeon(R) CPU @ 2.20GHz, 12.7GB RAM, and 107.7GB Disk Space.

Setup 1: Synthetic. We generate a synthetic dataset, where each coordinate of each $\mathbf{x}$ is generated from $\mathcal{N}(0,1)$ , and $\eta$ is generated from $\mathcal{N}(0,0.1)$ . For $\mathcal{D}_{S}$ , we generate $\boldsymbol{\theta}_{S}\in\mathbb{R}^{50}$ such that each coordinate is from $\mathcal{N}(0,0.01)$ . The size of our set $S$ thus obtained is 100,000. For $\mathcal{D}^{*}$ , we generate the coefficients $\boldsymbol{\theta}^{*}$ with each coordinate being generated from $\mathcal{N}(\mu,0.01)$ with $\mu$ taking values from [math] to $3$ at intervals of $0.1$ . As the value of $\mu$ increases the model distance between $f_{S}$ and $f^{*}$ increases.

Setup 2: ACS_Income. As a real-world dataset, we consider the normalized ACS_Income dataset, which exhibits well-known fairness issues between Gender and Racial groups (DHMS, 21). We chose $S$ to be generated through sampling from the subpopulation with the parameter Sex set to $2$ (Female), and the distribution $\mathcal{D}^{*}$ to be the subpopulation with Sex set to $1$ (Male). An important observation regarding this dataset is that the dataset does not satisfy the homoskedastcity assumption (Assumption 1). In particular, over 50 trials, the correlation coefficient between the response variable $y$ , and the residuals $y-f(\mathbf{x})$ w.r.t. the model $f$ obtained is $0.88$ .

Results and Observations. The findings from the experimental results on both the synthetic and the real-world data corroborate our theoretical results. The details are as follows:

Findings related to RQ1: We run SurVerify on each of the synthetic datasets and the ACS_Income dataset 50 times and record the average performance and $95$ percentile around it.

Acceptance Rate on Synthetic. In Figure 4 and 4, the BLUE curve indicates the acceptance rate of SurVerify on synthetic datasets described above w.r.t. $\mathcal{F}_{2}$ (Ridge) and $\mathcal{F}_{1}$ (Lasso), respectively. For both the model classes of $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ , SurVerify exhibits similar behavior. It starts with accepting all models when the difference of the coefficients, and correspondingly, the model distance is small. As the difference between the coefficients, and correspondingly the model distance increase, SurVerify starts rejecting with increasing probability, and rejects all the models generated with $\mu\geq 0.9,\texttt{FDD}\geq 0.20$ (resp. $\mu\geq 1,\texttt{FDD}\geq 0.21$ ) for model class $\mathcal{F}_{2}$ (resp. $\mathcal{F}_{1}$ ). The red and blue dashed vertical lines indicate the value of $\epsilon$ and $5\epsilon$ respectively. Hence, when the model-distance lies to the right of the blue line, SurVerify is expected to reject, whereas values to the left of the red line are expected to be accepted validating our theoretical results.
Acceptance Rate on ACS_Income: In Figure 4 and 4, the BLUE line indicates the acceptance rate of SurVerify on ACS_Income w.r.t. $\mathcal{F}_{2}$ and $\mathcal{F}_{1}$ , respectively. We run SurVerify with varying tolerance parameter $\epsilon$ . SurVerify always rejects for $\epsilon$ less than $0.01$ , and accepts for higher values. The red and blue dotted vertical lines indicate the value of $\texttt{FDD}\left({\mathcal{D}_{S},\mathcal{D}^{*}}\right)\approx 0.02$ and $\texttt{FDD}\left({\mathcal{D}_{S},\mathcal{D}^{*}}\right)/5$ respectively. Hence, as expected, we observe that for values of $\epsilon$ to the right of the red line, SurVerify is accepts more often, while for values to the left of the blue line, SurVerify rejects. This further indicates that the $\texttt{FDD}\left({\mathcal{D}_{S},\mathcal{D}^{*}}\right)$ between male and female subpopulations of ACS_Income is at least $0.02$ with probability $0.9$ .

Findings related to RQ2: Sample Complexity. In Figure 4, 4, 4 and 4, the GREEN curve demonstrates #samples-to-set from $\mathcal{D}_{S}$ that SurVerify needed. As expected, in Figure 4 and 4, i.e., while running on the synthetic dataset, as long as SurVerify accepts #samples-to-set are as per the worst-case complexity. But as the distance increases and the acceptance rate of SurVerify decreases, the number of #samples-to-set needed to reject also decreases. For both the model classes $\mathcal{F}_{2}$ , and $\mathcal{F}_{1}$ , the algorithm starts rejecting significantly faster once it reaches the $5\epsilon$ threshold.

For ACS_Income, since the FDD distance between the two distributions is $0.02$ , SurVerify accepts ( BLUE line) when $\epsilon$ increases. In this regime, we observe the predicted $1/\epsilon^{2}$ decay in the sample complexity ( GREEN line). But when $\epsilon$ goes smaller, SurVerify tends to reject. Specially, when $\epsilon\leq\texttt{FDD}\left({\mathcal{D}_{S},\mathcal{D}^{*}}\right)/5=0.004$ , the FDD distance being too far w.r.t. $\epsilon$ , the early stopping kicks in and the sample complexity hits a plateau.

In conclusion, we observe that the effective sample complexity of test decreases as the distance of $\mathcal{D}_{S}$ from $\mathcal{D}^{*}$ increases, and the effective sample complexity of SurVerify is much lower than that of the worst case complexity. Extended experimental results are presented in the Appendix.

6 Discussions, Limitations, and Future Works

We consider the problem of testing the credibility of survey data when used to develop a regression model. We propose an algorithm, SurVerify, that certifies the data quality by evaluating and testing the FDD metric between survey and the true distribution without explicitly reconstructing the models—an approach that, to the best of our knowledge, is novel in the testing literature. Notably, #samples-to-test required by SurVerify is independent of the data dimension, thereby overcoming the curse of dimensionality in this context.

In this paper, though we provide a general framework for testing credibility, our theoretical analysis focuses exclusively on linear and kernel regression models with bounded response, and homoskedastic, and non-correlated noise, which may limit its applicability. In future, it would be interesting to extend the model-specific credibility testing to regressions with heteroskedastic and correlated noise. Furthermore, it would be interesting to extending the testing framework of our algorithm beyond the regression models with bounded response, i.e. where closed-form Rademacher complexity based generalization bounds are not known. Furthermore, as indicated by the experiments, the proposed framework works for unbounded data coming from tail-bounded distributions. Thus, it will be interesting to extend the theoretical analysis to such settings.

Acknowledgement

This work has been supported by the Inria-ISI, Kolkata associate team “SeRAI”. We also acknowledge the French National Research Agency (ANR) in the framework of the PEPR AI project FOUNDRY (ANR-23-PEIA-0003), and the ANR JCJC for the REPUBLIC project (ANR-22-CE23-0003-01) for partially supporting this work.

Appendix

\parttoc

Appendix A FDD-variance Decomposition of Loss: Proof of Lemma 3

\distanceLemma

Proof.

Observe that

[TABLE]

∎

Appendix B Generic Two-sided Generalization Bounds: Proof of Thoerem 4.2

Before proving the theorem, we state the following results that are relevant to our proof:

Lemma 5 (Talagrand’s Contraction Lemma [33]).

Given a real-valued $\mu$ -lipschitz loss function $\mathcal{L}$ , a sample set $S$ and a hypothesis class $\mathcal{F}$ of real valued function, the following inequality holds:

[TABLE]

Lemma 6 (McDiarmid’s Inequality [36]).

Let $X_{1},X_{2},\ldots,X_{m}\in\mathcal{X}^{m}$ be iid random variables and there exists a constant $c$ such that ${f}:{\mathcal{X}^{m}}\rightarrow{\mathbb{R}}$ satisfies:

[TABLE]

Then, for any $\epsilon>0$ the following inequality hold:

[TABLE]

We also introduce the definition of Rademacher Complexity that only depends on the class of functions under consideration

Definition 7 (Rademacher Complexity [36]).

Let $S$ be a sample set of size $m\geq 1$ drawn as i.i.d. samples from the distribution $\mathcal{D}$ . Then, the Rademacher Complexity of $\mathcal{G}$ is the expectation of the empirical Rademacher complexity over all samples of size $m$ drawn from $\mathcal{D}$ :

[TABLE]

The next result is the intermediate lemma required, which quantifies how well the empirical mean estimates the true expectation over a bounded function class, in terms of its empirical Rademacher complexity (Definition 2).

Lemma 8 (Two-sided Rademacher Bound for Bounded Functions).

Given a family of functions $\mathcal{G}$ containing functions ${g}:{\mathcal{Z}}\rightarrow{\left[{0,M}\right]}$ . Let $S$ be a sample set of size $m\geq 1$ drawn as i.i.d. samples from the distribution $\mathcal{D}$ . Then with probability at least $1-\delta$ for all $g\in\mathcal{G}$ :

[TABLE]

Proof.

For a given sample set $S$ of size $m$ , let us denote by $\hat{\operatorname*{\mathbb{E}}}_{S}\left[{g}\right]$ the empirical loss $\frac{1}{m}\sum_{i\in[m]}g(z_{i})$ . Consequently, we define a function $\Phi$ corresponding of a sample set $S$ as:

[TABLE]

We first upper bound the expectation of this function $\Phi(S)$ over $S\in\mathcal{D}^{m}$ .

[TABLE]

Now, we will use the McDiarmid’s inequality(Lemma 6) on this function. For that purpose, observe that each coordinate of the input essentially corresponds to one of the data points in the sample. We use this fact and the boundedness of $g$ to obtain our prerequisite bound to apply McDiarmid’s inequality. Let us consider two sample sets $S$ and $S^{\prime}$ that differs at exactly one sample point, say the $i$ -th location. Then, we have:

[TABLE]

Here, the first inequality follows from the fact that $\sup_{x}\left({f(x)-g(x)}\right)\geq\sup_{x}f(x)-\sup_{x}g(x)$ , and Now, by McDiarmid’s inequality, we have,

[TABLE]

Combining equations (1) and (2), we have for all $g\in\mathcal{G}$ :

[TABLE]

Now, we bound the empirical Rademacher sample complexity in terms of the Rademacher complexity. We again consider two sample sets $S$ , and $S^{\prime}$ that differs at exactly one point, say $z_{i}$ . Then, using the fact that $\sup_{x}\left({f(x)-g(x)}\right)\geq\sup_{x}f(x)-\sup_{x}g(x)$ , we get

[TABLE]

Now, by McDiarmid’s Ineqality(Lemma 6), we have:

[TABLE]

Now, we combine equations (3) and (4) through an union bound to obtain:

[TABLE]

Similarly, we can show:

[TABLE]

Combining through a union bound, we get the desired result. ∎

Proof of Theorem 4.2. Now, we are ready to give the proof of Theorem 4.2. A restatement of the theorem is given below.

\twoSidedGeneralizationBound

Proof.

From Lemma 8, we know the two-sided deviation on the empirical mean w.r.t true expection over a bounded function class $\mathcal{G}$ containing functions ${g}:{\mathcal{Z}}\rightarrow{\left[{0,M}\right]}$ ,

[TABLE]

Take $\mathcal{G}$ to be the set of loss functions ${\mathcal{L}}:{\mathcal{Y}\times\mathcal{Y}}\rightarrow{\left[{0,M}\right]}$ , then for any $f\in\mathcal{F}$ we can write inequality (6) as,

[TABLE]

From Talagrand’s Contraction Lemma 5, we have

[TABLE]

Plugging back inequality (7) in (6) we get the following with probability at least $1-\delta$ :

[TABLE]

this completes the proof.

∎

Appendix C Minimum Survey Size for Learning Noise Variance: Proof of Lemma 4.2

This section is organized into three parts. In subsection C.1, we establish a two-sided generalization bound for the estimation of noise variance in the regression model, in terms of empirical rademacher complexity. In subsection C.2 we extends this result to both bounded linear and kernel hypothesis classes using their corresponding rademacher bounds. Finally, In subsection C.3 presents the proof of Lemma 4.2, which formalizes the minimum survey size required for estimating noise variance.

C.1 From Two-sided Generalization Bound to Estimating Noise Variance

The next result uses Theorem 4.2 applied to the squared loss setting, and combines it with the assumptions specific to linear regression over bounded domains to estimate the noise variance.

Lemma 9 (Concentration of Empirical Squared Loss around Noise Variance).

Given a linear regression model $f^{*}:y=f^{*}(\mathbf{x})+\eta$ , where $\eta$ is the zero-mean additive noise term with variance $\operatorname*{\mathrm{Var}}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}(\eta)=\sigma_{\eta}^{2}$ . Let $S$ be a sample set of size $m\geq 1$ drawn as i.i.d. samples from the distribution $\mathcal{D}_{S}$ . If Assumptions 1 and 2 holds, then the regression model $f_{S}\leftarrow\mathop{\mathrm{argmin}}_{f\in\mathcal{F}}\frac{1}{m}\sum_{(\mathbf{x},y)\in S}(f(\mathbf{x})-y)^{2}$ satisfies, with probability at least $1-\delta$ :

[TABLE]

Proof.

We split the proof into two steps,

Step 1: Generalization bound for squared loss. In this step, we show the two-sided generalization bound for squared loss. From Theorem 4.2, We consider the squared loss $\mathcal{L}(f(\mathbf{x}),y)=(f(\mathbf{x})-y)^{2}$ for all $(\mathbf{x},y)\in\mathcal{X}\times\mathcal{Y}$ and $f\in\mathcal{F}$ .

From assumption 2, we bound the maximum value of the loss function:

[TABLE]

Similarly, for the Lipschitzness of the loss function, we have:

[TABLE]

Now, applying Theorem 4.2 to the squared loss and function class $\mathcal{F}$ , we get,

[TABLE]

Step 2: Concentration of noise variance. We now show that the empirical squared loss of the estimator $f_{S}$ concentrates around the true noise variance $\sigma_{\eta}^{2}$ , using the generalization bound from Step 1 and Assumption 1. We prove the upper and lower bounds separately.

Upper Bound: Let $f^{*}$ is the optimal linear regression model on the true distribution $\mathcal{D}^{*}$ . Let $f^{*}_{S}$ is the optimal linear regression model on the survey distribution $\mathcal{D}_{S}$ .

From Assumption 1, we have:

[TABLE]

Therefore,

[TABLE]

The inequality (12) comes from the optimality of $f^{*}_{S}$ on $\mathcal{D}_{S}$ .

Now, applying the upper-sided generalization bound in (10) with $f=f_{S}$ , we have

[TABLE]

Combining (12) and (13) we get,

[TABLE]

Lower Bound: We now show the lower bound, by applying lower-sided generalization bound in (10) with $f=f^{*}_{S}$ we get,

[TABLE]

Since $f_{S}$ is the optimal regression model over the empirical loss. Therefore,

[TABLE]

Using inequality (16) in (15), we get

[TABLE]

Now, Combining (11) and (17) we get,

[TABLE]

Combining the upper bound (14) and lower bound (18) using the union bound we get, with probability at least $1-\delta$ :

[TABLE]

This completes the proof. ∎

C.2 From Generalization Bound to Noise variance for Linear and Kernel Classes

In this section, We show the general two-sided generalization bound for the empirical squared loss from Lemma 9 for specific families of hypothesis classes. In particular, we consider:

•

Linear function classes with bounded $\ell_{1}$ and $\ell_{2}$ norms, corresponding to Lasso and Ridge regression respectively.

•

Kernel-based functions classes with bounded RKHS norm, corresponding to Kernel.

In each case, we use upper bounds on the Rademacher complexity for the corresponding class, and then apply Lemma 9 to obtain corresponding generalization guarantees.

Case: $\mathcal{F}_{1}(\texttt{Lasso})$ and $\mathcal{F}_{2}(\texttt{Ridge})$

[1] has proved the following upper bound of the empirical Rademacher complexity for bounded linear hypothesis classes.

Lemma 10 (Empirical Rademacher Complexity of Bounded Linear Hypothesis [1]).

Let $\mathcal{F}_{p}=\{\mathbf{x}\rightarrow\left\langle{\boldsymbol{\theta}}\,,\,{\mathbf{x}}\right\rangle:\left\|{\boldsymbol{\theta}}\right\|_{p}\leq 1\}$ be a family of linear functions defined over $\mathbb{R}^{d}$ with bounded weight in $\ell_{p}$ -norm where $p\in\{1,2\}$ . Let $S=\left({\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{m}}\right)$ be a sample of size $m$ . Then, the empirical Rademacher complexity of $\mathcal{F}_{p}$ is upper bounded by:

[TABLE]

where $\mathbf{X}$ is a $d\times m$ matrix with $x_{i}$ ’s as columns: $\mathbf{X}=\left[{\mathbf{x}_{1}\ldots\mathbf{x}_{m}}\right]$

Lemma 11 (Two-sided Generalization Bound of $\ell_{1}$ and $\ell_{2}$ bounded linear hypothesis class).

Let $\mathcal{F}_{p}=\{\mathbf{x}\rightarrow\left\langle{\boldsymbol{\theta}}\,,\,{\mathbf{x}}\right\rangle:\left\|{\boldsymbol{\theta}}\right\|_{p}\leq 1\}$ . Given a linear regression model $f^{*}:y=f^{*}(\mathbf{x})+\eta$ , where $\eta$ is the zero-mean additive noise term with variance $\operatorname*{\mathrm{Var}}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}(\eta)=\sigma_{\eta}^{2}$ . Given a sample $S$ of size $m$ sampled i.i.d from a distribution $\mathcal{D}_{S}$ . If Assumption 1 and 2 holds, then the regression model

[TABLE]

satisfies, with probability at least $1-\delta$ :

[TABLE]

Proof.

From Assumption 2, we have $\left\|{\mathbf{x}}\right\|_{\infty}\leq 1$ for all $\mathbf{x}\in\mathcal{X}$ . Therefore, each column $\mathbf{x}_{i}$ of the matrix $\mathbf{X}\in\mathbb{R}^{d\times m}$ satisfies $\left\|{\mathbf{x}_{i}}\right\|_{\infty}\leq 1$ . then $\left\|{\mathbf{X}^{T}}\right\|_{2,\infty}\leq\sqrt{m}$ and $\left\|{\mathbf{X}^{T}}\right\|_{2,2}\leq\sqrt{dm}$ .

For $p=1$ (Lasso): From Lemma 10, we get:

[TABLE]

For $p=2$ (Ridge): Again, from Lemma 10,

[TABLE]

Now, plugging the above bounds on $\widehat{\mathcal{R}}_{S}(\mathcal{F}_{p})$ into Lemma 9 yields:

[TABLE]

which gives the desired bounds:

[TABLE]

∎

Case: $\mathcal{F}_{\mathbf{K}}(\texttt{Kernel})$

We define the hypothesis class as:

[TABLE]

where $\mathbf{\Phi}:\mathcal{X}\to\mathbb{H}$ is the feature map associated with a positive definite symmetric (PDS) kernel $\mathbf{K}:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ .

We first recall the following Rademacher complexity bound for kernel regression from [36, Theorem 6.12]:

Lemma 12 (PDS Kernel Rademacher Complexity Bound [36]).

Let $\mathbf{K}$ be a PDS kernel with associated feature map $\mathbf{\Phi}$ satisfying $\mathbf{K}(\mathbf{x},\mathbf{x})=\left\|{\mathbf{\Phi}(\mathbf{x})}\right\|^{2}_{\mathbb{H}}\leq r^{2}$ for all $\mathbf{x}\in\mathcal{X}$ . Then, for any i.i.d. sample $S$ of size $m$ , the empirical Rademacher complexity of $\mathcal{F}_{\mathbf{K}}$ satisfies:

[TABLE]

Lemma 13 (Two-sided Generalization Error of Bounded Kernel Hypothesis).

Let $\mathcal{F}_{\mathbf{K}}$ be defined as above and suppose $\mathbf{K}(\mathbf{x},\mathbf{x})\leq r^{2}$ for all $\mathbf{x}\in\mathcal{X}$ . Given a linear regression model $f^{*}:y=f^{*}(\mathbf{x})+\eta$ , where $\eta$ is the zero-mean additive noise term with variance $\operatorname*{\mathrm{Var}}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}(\eta)=\sigma_{\eta}^{2}$ and sample $S$ of size $m$ sampled i.i.d from a distribution $\mathcal{D}_{S}$ . If Assumptions 1 and 2 holds, then the regression model

[TABLE]

satisfies, with probability at least $1-\delta$ :

[TABLE]

Proof.

From Lemma 12, we have:

[TABLE]

Plugging this into the general bound from Lemma 9 obtains the stated result. ∎

C.3 From Noise Variance Bounds to Minimum Survey Size: Proof of Lemma 4.2

We now translate the generalization error bounds derived for $\texttt{Lasso},\texttt{Ridge}$ and Kernel into sample size guarantees for estimating the noise variance, which leads to the proof of Lemma 4.2.

\sampleComplexityNoiseVariance

Proof.

From Lemmas 11 and 13, we have that with probability at least $1-\delta$ ,

[TABLE]

We now choose $m$ large enough so that each term on the right-hand side of (19) is at most $\epsilon/20$ , ensuring the total bound is at most $\epsilon/10$ .

For $p=1$ (Lasso): Set

[TABLE]

Then,

[TABLE]

Summing the two terms in (19) gives a bound of at most $\epsilon/10$ .

For $p=2\;(\texttt{Ridge})$ : Similarly, taking

[TABLE]

yields the desired bound.

For Kernel: Taking

[TABLE]

ensures that each term on the right-hand side of the kernel bound in (19) is at most $\epsilon/20$ , completing the proof. ∎

Appendix D Correctness of SurVerify and Sample Complexity: Proof of Theorem 4.2

Now, we present the proof of the correctness of our algorithm. The restatement of the theorem is given below. \surverifyCorrectness*

Proof.

The sample complexity of the algorithm can be easily seen from the algorithm. The main thing to prove is the correctness of the algorithm. We will prove the two parts separately. To start, we observe that from Lemma 4.2, we have

[TABLE]

Let $\hat{\gamma}_{t}$ be the value of $\hat{\gamma}$ are $t$ rounds of the while loop. From the Linearity of Expectation, we have

[TABLE]

From Assumption 2, we have $(f_{S}(x_{i})-y)^{2}\in[0,4]$ for all $i\in[t]$ . Since each of the $t$ independent variables is bounded, we now use Hoeffding’s inequality to bound the deviation of $\hat{\gamma}_{t}$ from its expectation.

Proof of 1. In this case we have to bound the probability that SurVerify outputs REJECT at any of the $t$ -iterations of the while loop or in the if statement at the end (line 12 to 15).

by Hoeffding’s Lemma, at any round $t\in[\tau]$ , we have

[TABLE]

If $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\leq\epsilon$ , then from Lemma 3

[TABLE]

Thus if $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\leq\epsilon$ and $|\sigma_{\eta}^{2}-\hat{L}_{S}|\leq 0.1\epsilon$ then

[TABLE]

Combining (22) and (24) using union bound, we get at any round $t$ ,

[TABLE]

So, if $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\leq\epsilon$ and $|\sigma_{\eta}^{2}-\hat{L}_{S}|\leq 0.1\epsilon$ then the probability that SurVerify output REJECT in the while loop is at most $\delta/3$ . Also, at the end of the while loop let $\hat{\gamma}_{\tau}$ be the value of $\hat{\gamma}$ . The value of $\tau$ has been so chosen that

[TABLE]

Combining Equation (26) and (24) we see that if $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\leq\epsilon$ and $|\sigma_{\eta}^{2}-\hat{L}_{S}|\leq 0.1\epsilon$ then

[TABLE]

Finally, combining Equation (27), (25) and (20) we have that if $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\leq\epsilon$ then probability that SurVerify outputs REJECTS is $\delta$ .

**Proof of 2.: ** The proof of this part is simpler than the proof of 1.. If $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\geq 5\epsilon$ then we show that SurVerify output ACCEPT in the final if statement is less than $\delta$ . By Hoeffding’s inequality we have the Equation (26). Combining Equation (26) with Lemma 3 and Equation (20) we see that if $\texttt{dist}_{\mathcal{D}^{*}}^{2}(f_{S},f^{*})\geq 5\epsilon$ then

[TABLE]

This completes the proof. ∎

Appendix E Lower Bound for Model Reconstruction

The task of checking if the regression coefficient for the data in $S$ is close to the regression coefficient for $\mathcal{D}^{*}$ can be checked directly by generating an estimate $\widehat{\boldsymbol{\theta}}$ of the optimal regression coefficient $\boldsymbol{\theta}^{*}$ corresponding to $\mathcal{D}^{*}$ . However, the number of samples required for this approximate recovery problem grows with the dimension of the data. The following lemma, due to [20] quantifies this dependence:

Lemma 14 ( [20]).

For a regression model $y=\left\langle{\boldsymbol{\theta}^{*}}\,,\,{\mathbf{x}}\right\rangle+\eta$ with $\eta\sim\mathcal{N}\left({0,\sigma_{\eta}^{2}}\right)$ for $\mathbf{x},\boldsymbol{\theta}^{*}\in\mathbb{R}^{d}$ with $d\geq 2$ , any algorithm that produces an estimate $\widehat{\boldsymbol{\theta}}$ of $\boldsymbol{\theta}^{*}$ using $m$ samples must satisfy:

[TABLE]

In particular, If Assumption 2 holds, we have:

[TABLE]

We now provide the proof for Lemma 4.3, restated here.

\LBReconst

Proof.

Let $\boldsymbol{\theta}_{e}\coloneqq\boldsymbol{\theta}^{*}-\widehat{\boldsymbol{\theta}}$ . Then for any $\mathbf{x}\in\mathbb{R}^{d}$ , the difference between the true and estimated predictions is:

[TABLE]

Therefore,

[TABLE]

Now, by setting the distance $\sqrt{\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\mathcal{D}^{*}}\left[{\left\langle{\boldsymbol{\theta}_{e}}\,,\,{\mathbf{x}}\right\rangle^{2}}\right]}$ to be less than or equal to $\epsilon$ , we get

[TABLE]

∎

Hence, we use the loss to identify the model distance between these two quantities. The loss of the two regressions follow two gaussians with different means and same variance. Here, we state a lower bound on the difference of means in this setup.

Appendix F Experimental Results

In this section, we detail the outcomes of our experiments described in Section 5. In Table 3 and 4, we list the outcomes of SurVerify on the synthetic dataset w.r.t. the $\mathcal{F}_{2}$ and $\mathcal{F}_{1}$ model classes, respectively. In Table 5 and 6, we list the outcomes of SurVerify on ACS_Income dataset w.r.t. the $\mathcal{F}_{2}$ and $\mathcal{F}_{1}$ model classes, respectively. As stated in Section 5, we have run 50 trials for all parameter choices, i.e. each row in the tables. The $\delta$ is set to $0.01$ throughout. We also reproduce the figures here for ease of reading. The red and blue lines represent the values of the red and blue lines of their respective plots as defined in Section 5.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1AFM [20] Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. On the Rademacher complexity of linear hypothesis sets. Co RR , abs/2007.11045, 2020.
2AMF [20] David Alvarez-Melis and Nicolò Fusi. Geometric dataset distances via optimal transport. In Proceedings of the 34th International Conference on Neural Information Processing Systems , NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
3BBC + [20] Ivona Bezáková, Antonio Blanca, Zongchen Chen, Daniel Štefankovič, and Eric Vigoda. Lower bounds for testing graphical models: Colorings and antiferromagnetic ising models. Journal of Machine Learning Research , 21(25):1–62, 2020.
4B Cv V [21] Antonio Blanca, Zongchen Chen, Daniel Štefankovič, and Eric Vigoda. Hardness of identity testing for restricted boltzmann machines and potts models. Journal of Machine Learning Research , 22(152):1–56, 2021.
5BDI + [20] Abhijit Banerjee, Esther Duflo, Clement Imbert, Santhosh Mathew, and Rohini Pande. E-governance, accountability, and leakage in public programs: Experimental evidence from a financial management reform in india. American Economic Journal: Applied Economics , 12(4):39–72, 2020.
6BFR + [00] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In FOCS 2000 , pages 259–269. IEEE, 2000.
7BGKV [21] Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, and N. V. Vinodchandran. Testing product distributions: A closer look. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory , volume 132 of Proceedings of Machine Learning Research , pages 367–396. PMLR, 16–19 Mar 2021.
8BGMV [20] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S Meel, and N. V. Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 14699–14711. Curran Associates, Inc., 2020.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression

Abstract

1 Introduction

Definition 1** (Distributional ℓ2\ell_{2}ℓ2​-Distance between Functions).**

Assumption 1** (Exogenous Noise).**

Assumption 2** (Boundedness).**

Organization of the paper:

2 Preliminaries: Regression Models and Rademacher Complexity

A Primer on Regression: Linear and Kernel.

Rademacher Complexity.

Definition 2** (Empirical Rademacher Complexity).**

3 Functional Distance of Distributions (FDD): A Novel Metric

Definition 3** (FDDD∗F(D1,D2)\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{1},\mathcal{D}_{2})FDDD∗F​(D1​,D2​)).**

4 SurVerify: Testing Credibility with Regression and Fixed Confidence

4.1 Dimension Agnostic Algorithm Design with Early Stopping

4.2 Theoretical Analysis: Correctness, Sample Complexity, and Sufficient Size of Survey

Requirement: Sufficient Size of the Survey Data.

Remark 4** (sss-Sparse Linear Regression**).

4.3 Lower Bound on Sample Complexity: Advantage of Not Reconstructing the Model

5 Experimental Analysis

6 Discussions, Limitations, and Future Works

Acknowledgement

Appendix

Appendix A FDD-variance Decomposition of Loss: Proof of Lemma 3

Proof.

Appendix B Generic Two-sided Generalization Bounds: Proof of Thoerem 4.2

Lemma 5** (Talagrand’s Contraction Lemma [33]).**

Lemma 6** (McDiarmid’s Inequality [36]).**

Definition 7** (Rademacher Complexity [36]).**

Lemma 8** (Two-sided Rademacher Bound for Bounded Functions).**

Proof.

Proof.

Appendix C Minimum Survey Size for Learning Noise Variance: Proof of Lemma 4.2

C.1 From Two-sided Generalization Bound to Estimating Noise Variance

Lemma 9** (Concentration of Empirical Squared Loss around Noise Variance).**

Proof.

C.2 From Generalization Bound to Noise variance for Linear and Kernel Classes

Lemma 10** (Empirical Rademacher Complexity of Bounded Linear Hypothesis ** [1]).

Lemma 11** (Two-sided Generalization Bound of ℓ1\ell_{1}ℓ1​ and ℓ2\ell_{2}ℓ2​ bounded linear hypothesis class).**

Proof.

Lemma 12** (PDS Kernel Rademacher Complexity Bound [36]).**

Lemma 13** (Two-sided Generalization Error of Bounded Kernel Hypothesis).**

Proof.

C.3 From Noise Variance Bounds to Minimum Survey Size: Proof of Lemma 4.2

Proof.

Appendix D Correctness of SurVerify and Sample Complexity: Proof of Theorem 4.2

Proof.

Appendix E Lower Bound for Model Reconstruction

Lemma 14** ( [20]).**

Proof.

Appendix F Experimental Results

Definition 1 (Distributional $\ell_{2}$ -Distance between Functions).

Assumption 1 (Exogenous Noise).

Assumption 2 (Boundedness).

Definition 2 (Empirical Rademacher Complexity).

Definition 3 ( $\texttt{FDD}_{\mathcal{D}^{*}}^{\mathcal{F}}(\mathcal{D}_{1},\mathcal{D}_{2})$ ).

Remark 4 ( $s$ -Sparse Linear Regression).

Lemma 5 (Talagrand’s Contraction Lemma [33]).

Lemma 6 (McDiarmid’s Inequality [36]).

Definition 7 (Rademacher Complexity [36]).

Lemma 8 (Two-sided Rademacher Bound for Bounded Functions).

Lemma 9 (Concentration of Empirical Squared Loss around Noise Variance).

Lemma 10 (Empirical Rademacher Complexity of Bounded Linear Hypothesis [1]).

Lemma 11 (Two-sided Generalization Bound of $\ell_{1}$ and $\ell_{2}$ bounded linear hypothesis class).

Lemma 12 (PDS Kernel Rademacher Complexity Bound [36]).

Lemma 13 (Two-sided Generalization Error of Bounded Kernel Hypothesis).

Lemma 14 ( [20]).