Robust Inference via Multiplier Bootstrap

Xi Chen; Wen-Xin Zhou

arXiv:1903.07208·math.ST·March 19, 2019

Robust Inference via Multiplier Bootstrap

Xi Chen, Wen-Xin Zhou

PDF

Open Access

TL;DR

This paper develops a robust inference method using multiplier bootstrap combined with adaptive Huber regression, effectively handling heavy-tailed data in confidence set construction and hypothesis testing, outperforming traditional least squares approaches.

Contribution

It introduces a novel robust inference framework that integrates adaptive Huber regression with multiplier bootstrap, addressing heavy-tailed data challenges.

Findings

01

The proposed method improves finite sample properties over least squares.

02

It provides reliable confidence sets and hypothesis tests under heavy-tailed noise.

03

Empirical results confirm the theoretical advantages of the approach.

Abstract

This paper investigates the theoretical underpinnings of two fundamental statistical inference problems, the construction of confidence sets and large-scale simultaneous hypothesis testing, in the presence of heavy-tailed data. With heavy-tailed observation noise, finite sample properties of the least squares-based methods, typified by the sample mean, are suboptimal both theoretically and empirically. In this paper, we demonstrate that the adaptive Huber regression, integrated with the multiplier bootstrap procedure, provides a useful robust alternative to the method of least squares. Our theoretical and empirical results reveal the effectiveness of the proposed method, and highlight the importance of having inference methods that are robust to heavy tailedness.

Figures9

Click any figure to enlarge with its caption.

Tables14

Table 1. Table 1: Average coverage probabilities with n = 100 𝑛 100 n=100 and d = 5 𝑑 5 d=5 for different nominal coverage levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.954	0.908	0.842	0.783	0.734
	boot-OLS	0.952	0.908	0.837	0.785	0.735
$t_{ν}$
	boot-Huber	0.966	0.904	0.848	0.801	0.748
	boot-OLS	0.954	0.887	0.798	0.710	0.630
Gamma
	boot-Huber	0.962	0.918	0.860	0.798	0.747
	boot-OLS	0.955	0.910	0.843	0.775	0.700
Wbl mix
	boot-Huber	0.962	0.907	0.851	0.797	0.758
	boot-OLS	0.944	0.899	0.808	0.775	0.680
Par mix
	boot-Huber	0.955	0.907	0.856	0.802	0.761
	boot-OLS	0.948	0.900	0.843	0.785	0.738
Logn mix
	boot-Huber	0.958	0.912	0.860	0.782	0.744
	boot-OLS	0.954	0.912	0.796	0.682	0.616

Table 2. Table 3: Average coverage probabilities with n = 200 𝑛 200 n=200 , d = 5 𝑑 5 d=5 for different nominal coverage levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.957	0.910	0.850	0.790	0.736
	boot-OLS	0.955	0.907	0.850	0.789	0.736
$t_{ν}$
	boot-Huber	0.958	0.906	0.848	0.798	0.749
	boot-OLS	0.940	0.863	0.772	0.684	0.599
Gamma
	boot-Huber	0.948	0.899	0.845	0.780	0.726
	boot-OLS	0.944	0.889	0.822	0.751	0.685
Wbl mix
	boot-Huber	0.954	0.889	0.837	0.775	0.713
	boot-OLS	0.939	0.865	0.784	0.695	0.621
Par mix
	boot-Huber	0.945	0.898	0.847	0.789	0.738
	boot-OLS	0.941	0.886	0.820	0.757	0.700
Logn mix
	boot-Huber	0.958	0.916	0.864	0.812	0.748
	boot-OLS	0.938	0.886	0.812	0.718	0.590

Table 3. Table 4: Average coverage probabilities for the Wbl mix error and for different nominal coverage levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Approach	$d$	$n$	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
boot-Huber
		50	0.951	0.904	0.848	0.789	0.725
	2	100	0.959	0.914	0.866	0.827	0.771
		200	0.954	0.917	0.856	0.814	0.756
		50	0.982	0.945	0.876	0.826	0.752
	5	100	0.966	0.917	0.855	0.802	0.760
		200	0.950	0.894	0.835	0.777	0.721
		50	0.990	0.972	0.955	0.915	0.881
	10	100	0.980	0.949	0.897	0.850	0.799
		200	0.970	0.922	0.864	0.826	0.777
boot-OLS
		50	0.942	0.887	0.827	0.758	0.672
	2	100	0.956	0.901	0.849	0.785	0.714
		200	0.947	0.898	0.822	0.763	0.685
		50	0.976	0.911	0.836	0.754	0.688
	5	100	0.954	0.896	0.824	0.751	0.674
		200	0.940	0.868	0.790	0.698	0.622
		50	0.997	0.970	0.919	0.844	0.761
	10	100	0.975	0.921	0.850	0.784	0.719
		200	0.954	0.879	0.816	0.731	0.650

Table 4. Table 5: Average coverage probabilities for different nominal coverage levels 1 − α ∈ { 0.99 , 0.97 , 0.95 , 0.9 , 0.87 } 1 𝛼 0.99 0.97 0.95 0.9 0.87 1-\alpha\in\{0.99,0.97,0.95,0.9,0.87\} . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	0.99	0.97	0.95	0.90	0.87
$𝒩 (0, 1)$
	adaptive boot-Huber	0.993	0.970	0.942	0.896	0.868
	boot-Huber	0.991	0.971	0.946	0.899	0.868
	boot-OLS	0.993	0.970	0.948	0.895	0.868
$L o g n (0, 1)$
	adaptive boot-Huber	0.994	0.978	0.961	0.919	0.880
	boot-Huber	0.997	0.983	0.969	0.935	0.895
	boot-OLS	0.997	0.978	0.955	0.848	0.750
$L o g n (0, 1.5)$
	adaptive boot-Huber	0.994	0.980	0.961	0.916	0.882
	boot-Huber	1.000	0.992	0.978	0.948	0.921
	boot-OLS	0.999	0.989	0.972	0.864	0.710
$L o g n (0, 2)$
	adaptive boot-Huber	0.995	0.979	0.961	0.904	0.881
	boot-Huber	1.000	0.996	0.989	0.958	0.939
	boot-OLS	1.000	0.996	0.980	0.879	0.692

Table 5. Table 6: Standard deviations of the estimated quantiles for ( n , d ) = ( 100 , 5 ) 𝑛 𝑑 100 5 (n,d)=(100,5) and nominal levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1)

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.210	0.301	0.365	0.412	0.443
	boot-OLS	0.214	0.299	0.365	0.416	0.446
$t_{ν}$
	boot-Huber	0.191	0.295	0.364	0.399	0.442
	boot-OLS	0.222	0.336	0.420	0.467	0.491
Gamma
	boot-Huber	0.191	0.292	0.366	0.410	0.441
	boot-OLS	0.220	0.309	0.384	0.423	0.456
Wbl mix
	boot-Huber	0.176	0.265	0.343	0.392	0.435
	boot-OLS	0.201	0.308	0.392	0.440	0.472
Par mix
	boot-Huber	0.181	0.286	0.358	0.408	0.443
	boot-OLS	0.196	0.324	0.414	0.466	0.488
Logn mix
	boot-Huber	0.176	0.268	0.331	0.392	0.425
	boot-OLS	0.189	0.327	0.416	0.461	0.488

Table 6. Table 7: Standard deviations of the estimated quantiles for ( n , d ) = ( 200 , 5 ) 𝑛 𝑑 200 5 (n,d)=(200,5) and nominal levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1)

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.232	0.333	0.390	0.427	0.456
	boot-OLS	0.236	0.340	0.388	0.431	0.455
$t_{ν}$
	boot-Huber	0.205	0.291	0.357	0.407	0.445
	boot-OLS	0.236	0.342	0.437	0.481	0.497
Gamma
	boot-Huber	0.212	0.295	0.358	0.401	0.440
	boot-OLS	0.232	0.307	0.366	0.415	0.451
Wbl mix
	boot-Huber	0.194	0.292	0.364	0.409	0.439
	boot-OLS	0.220	0.335	0.409	0.447	0.480
Par mix
	boot-Huber	0.168	0.252	0.333	0.395	0.433
	boot-OLS	0.189	0.314	0.415	0.469	0.495
Logn mix
	boot-Huber	0.232	0.314	0.379	0.418	0.447
	boot-OLS	0.245	0.363	0.452	0.491	0.500

Table 7. Table 8: Average coverage probabilities for ( n , d ) = ( 100 , 5 ) 𝑛 𝑑 100 5 (n,d)=(100,5) and nominal levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] when 𝑿 i subscript 𝑿 𝑖 \bm{X}_{i} are IID from a multivariate uniform distribution. Each component of 𝜽 ∗ superscript 𝜽 \bm{\theta}^{*} follows Ber ( 0.5 ) Ber 0.5 \text{Ber}(0.5) , and W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.946	0.898	0.848	0.781	0.725
	boot-OLS	0.944	0.898	0.843	0.782	0.720
$t_{ν}$
	boot-Huber	0.968	0.919	0.877	0.825	0.777
	boot-OLS	0.954	0.881	0.803	0.729	0.642
Gamma
	boot-Huber	0.961	0.911	0.868	0.812	0.751
	boot-OLS	0.958	0.900	0.842	0.778	0.716
Wbl mix
	boot-Huber	0.963	0.907	0.866	0.808	0.748
	boot-OLS	0.947	0.880	0.817	0.724	0.663
Par mix
	boot-Huber	0.974	0.928	0.882	0.842	0.775
	boot-OLS	0.963	0.897	0.815	0.715	0.634
Logn mix
	boot-Huber	0.972	0.936	0.888	0.834	0.780
	boot-OLS	0.962	0.901	0.804	0.701	0.615

Table 8. Table 9: Average coverage probabilities for ( n , d ) = ( 200 , 5 ) 𝑛 𝑑 200 5 (n,d)=(200,5) and nominal levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] when 𝑿 i subscript 𝑿 𝑖 \bm{X}_{i} are IID from a multivariate uniform distribution. Each component of 𝜽 ∗ superscript 𝜽 \bm{\theta}^{*} follows Ber ( 0.5 ) Ber 0.5 \text{Ber}(0.5) , and W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.953	0.893	0.844	0.799	0.743
	boot-OLS	0.955	0.893	0.846	0.798	0.744
$t_{ν}$
	boot-Huber	0.960	0.910	0.850	0.795	0.750
	boot-OLS	0.948	0.860	0.759	0.657	0.586
Gamma
	boot-Huber	0.949	0.896	0.836	0.782	0.729
	boot-OLS	0.947	0.886	0.817	0.765	0.704
Wbl mix
	boot-Huber	0.957	0.906	0.861	0.811	0.766
	boot-OLS	0.941	0.879	0.805	0.722	0.656
Par mix
	boot-Huber	0.963	0.924	0.862	0.798	0.743
	boot-OLS	0.958	0.869	0.751	0.669	0.581
Logn mix
	boot-Huber	0.958	0.909	0.849	0.795	0.739
	boot-OLS	0.947	0.861	0.723	0.600	0.531

Table 9. Table 10: Average coverage probabilities for ( n , d ) = ( 100 , 5 ) 𝑛 𝑑 100 5 (n,d)=(100,5) and nominal levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] when 𝑿 i subscript 𝑿 𝑖 \bm{X}_{i} are IID from a multivariate normal distribution with Toplitz covariance structure and 𝜽 ∗ superscript 𝜽 \bm{\theta}^{*} is a vector of equally spaced points in [ 0 , 1 ] 0 1 [0,1] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.954	0.899	0.842	0.783	0.732
	boot-OLS	0.952	0.901	0.842	0.778	0.726
$t_{ν}$
	boot-Huber	0.962	0.904	0.843	0.802	0.734
	boot-OLS	0.948	0.870	0.772	0.678	0.594
Gamma
	boot-Huber	0.962	0.906	0.841	0.786	0.736
	boot-OLS	0.949	0.893	0.820	0.767	0.706
Wbl mix
	boot-Huber	0.968	0.924	0.864	0.811	0.747
	boot-OLS	0.958	0.894	0.811	0.737	0.665
Par mix
	boot-Huber	0.966	0.910	0.849	0.790	0.733
	boot-OLS	0.960	0.881	0.780	0.681	0.609
Logn mix
	boot-Huber	0.968	0.922	0.875	0.811	0.763
	boot-OLS	0.963	0.878	0.778	0.693	0.608

Table 10. Table 11: Average coverage probabilities for ( n , d ) = ( 200 , 5 ) 𝑛 𝑑 200 5 (n,d)=(200,5) and nominal levels 1 − α = [ 0.95 , 0.9 , 0.85 , 0.8 , 0.75 ] 1 𝛼 0.95 0.9 0.85 0.8 0.75 1-\alpha=[0.95,0.9,0.85,0.8,0.75] when 𝑿 i subscript 𝑿 𝑖 \bm{X}_{i} are IID from a multivariate normal distribution with Toplitz covariance structure and 𝜽 ∗ superscript 𝜽 \bm{\theta}^{*} is a vector of equally spaced points in [ 0 , 1 ] 0 1 [0,1] . The weights W i subscript 𝑊 𝑖 W_{i} are generated from 𝒩 ( 1 , 1 ) 𝒩 1 1 \mathcal{N}(1,1) .

Noise	Approach	$0.95$	$0.9$	$0.85$	$0.8$	$0.75$
Gaussian
	boot-Huber	0.943	0.873	0.813	0.761	0.706
	boot-OLS	0.941	0.867	0.815	0.754	0.708
$t_{ν}$
	boot-Huber	0.956	0.907	0.850	0.791	0.729
	boot-OLS	0.941	0.865	0.744	0.639	0.561
Gamma
	boot-Huber	0.953	0.904	0.849	0.799	0.738
	boot-OLS	0.943	0.895	0.841	0.779	0.717
Wbl mix
	boot-Huber	0.961	0.906	0.843	0.788	0.739
	boot-OLS	0.949	0.871	0.788	0.724	0.641
Par mix
	boot-Huber	0.971	0.932	0.873	0.807	0.751
	boot-OLS	0.963	0.889	0.779	0.674	0.572
Logn mix
	boot-Huber	0.943	0.889	0.826	0.775	0.725
	boot-OLS	0.936	0.844	0.715	0.597	0.516

Table 11. Table 12: Empirical FDP and power for ( n , s ) = ( 100 , 5 ) 𝑛 𝑠 100 5 (n,s)=(100,5) . The nominal level α 𝛼 \alpha takes value in { 0.05 , 0.10 , 0.15 , 0.20 , 0.25 } 0.05 0.10 0.15 0.20 0.25 \{0.05,0.10,0.15,0.20,0.25\} . The signal strength μ k = 1.5 2 ( log ⁡ m ) / n subscript 𝜇 𝑘 1.5 2 𝑚 𝑛 \mu_{k}=1.5\sqrt{2(\log m)/n} for 1 ≤ k ≤ m 1 1 𝑘 subscript 𝑚 1 1\leq k\leq m_{1} .

Noise	$α$	0.05	0.10	0.15	0.20	0.25
Gaussian
	FDP	0.027	0.044	0.075	0.106	0.138
	Power	0.935	0.962	0.978	0.986	0.989
$t_{ν}$
	FDP	0.017	0.030	0.053	0.080	0.105
	Power	0.928	0.953	0.969	0.978	0.983
Gamma
	FDP	0.048	0.076	0.119	0.159	0.197
	Power	0.957	0.981	0.993	0.996	0.999
Wbl mix
	FDP	0.038	0.060	0.098	0.130	0.160
	Power	0.951	0.977	0.988	0.993	1.000
Par mix
	FDP	0.033	0.056	0.094	0.129	0.165
	Power	0.998	0.999	1.000	1.000	1.000
Logn mix
	FDP	0.029	0.067	0.108	0.149	0.184
	Power	0.999	1.000	1.000	1.000	1.000

Table 12. Table 13: Empirical FDP and power for the Wbl mix model. The nominal level α 𝛼 \alpha takes value in { 0.05 , 0.10 , 0.15 , 0.20 , 0.25 } 0.05 0.10 0.15 0.20 0.25 \{0.05,0.10,0.15,0.20,0.25\} . The signal strength μ k = 1.5 2 ( log ⁡ m ) / n subscript 𝜇 𝑘 1.5 2 𝑚 𝑛 \mu_{k}=1.5\sqrt{2(\log m)/n} for 1 ≤ k ≤ m 1 1 𝑘 subscript 𝑚 1 1\leq k\leq m_{1} .

$s$	$n$	$α$	0.05	0.10	0.15	0.20	0.25
2	100	FDP	0.061	0.098	0.143	0.186	0.232
		Power	0.981	0.991	0.995	0.997	0.998
	200	FDP	0.048	0.101	0.149	0.195	0.242
		Power	0.989	0.995	0.997	0.998	0.998
5	100	FDP	0.038	0.060	0.098	0.130	0.160
		Power	0.951	0.977	0.988	0.993	1.000
	200	FDP	0.056	0.092	0.137	0.180	0.223
		Power	0.938	0.991	0.996	0.997	0.999
10	100	FDP	0.006	0.014	0.022	0.035	0.052
		Power	0.607	0.825	0.897	0.940	0.961
	200	FDP	0.043	0.070	0.110	0.145	0.187
		Power	0.971	0.983	0.989	0.993	0.995

Table 13. Table 14: Empirical FDP and power for ( n , s ) = ( 100 , 5 ) 𝑛 𝑠 100 5 (n,s)=(100,5) . The nominal level α 𝛼 \alpha takes value in { 0.05 , 0.10 , 0.15 , 0.20 , 0.25 } 0.05 0.10 0.15 0.20 0.25 \{0.05,0.10,0.15,0.20,0.25\} . The signal strength μ k = 3 2 ( log ⁡ m ) / n subscript 𝜇 𝑘 3 2 𝑚 𝑛 \mu_{k}=3\sqrt{2(\log m)/n} for 1 ≤ k ≤ m 1 1 𝑘 subscript 𝑚 1 1\leq k\leq m_{1} .

Noise	$α$	0.05	0.10	0.15	0.20	0.25
Gaussian
	FDP	0.015	0.039	0.063	0.093	0.124
	Power	1.000	1.000	1.000	1.000	1.000
$t_{ν}$
	FDP	0.009	0.027	0.046	0.072	0.098
	Power	0.999	1.000	1.000	1.000	1.000
Gamma
	FDP	0.038	0.063	0.103	0.136	0.178
	Power	1.000	1.000	1.000	1.000	1.000
Wbl mix
	FDP	0.038	0.049	0.089	0.120	0.156
	Power	0.999	1.000	1.000	1.000	1.000
Par mix
	FDP	0.037	0.060	0.100	0.135	0.167
	Power	1.000	1.000	1.000	1.000	1.000
Logn mix
	FDP	0.024	0.067	0.099	0.128	0.157
	Power	1.000	1.000	1.000	1.000	1.000

Table 14. Table 15: Empirical FDP and power for the Wbl mix model. The nominal level α 𝛼 \alpha takes value in { 0.05 , 0.10 , 0.15 , 0.20 , 0.25 } 0.05 0.10 0.15 0.20 0.25 \{0.05,0.10,0.15,0.20,0.25\} . The signal strength μ k = 3 2 ( log ⁡ m ) / n subscript 𝜇 𝑘 3 2 𝑚 𝑛 \mu_{k}=3\sqrt{2(\log m)/n} for 1 ≤ k ≤ m 1 1 𝑘 subscript 𝑚 1 1\leq k\leq m_{1} .

$s$	$n$	$α$	0.05	0.10	0.15	0.20	0.25
2	100	FDP	0.042	0.087	0.125	0.179	0.221
		Power	1.000	1.000	1.000	1.000	1.000
	200	FDP	0.049	0.102	0.144	0.187	0.234
		Power	1.000	1.000	1.000	1.000	1.000
5	100	FDP	0.026	0.049	0.089	0.120	0.156
		Power	0.999	1.000	1.000	1.000	1.000
	200	FDP	0.040	0.069	0.089	0.132	0.184
		Power	1.000	1.000	1.000	1.000	1.000
10	100	FDP	0.011	0.014	0.022	0.041	0.054
		Power	0.991	0.995	0.999	0.999	0.999
	200	FDP	0.040	0.069	0.102	0.131	0.166
		Power	1.000	1.000	1.000	1.000	1.000

Equations834

Y_{i} = X_{i}^{⊺} θ^{*} + ε_{i}, i = 1, \dots, n .

Y_{i} = X_{i}^{⊺} θ^{*} + ε_{i}, i = 1, \dots, n .

P (∣ μ_{n} - μ ∣ \geq σ \frac{1}{δ n}) \leq δ \mbox f or an y δ \in (0, 1) .

P (∣ μ_{n} - μ ∣ \geq σ \frac{1}{δ n}) \leq δ \mbox f or an y δ \in (0, 1) .

P {∣ μ_{n} - μ ∣ \geq σ \frac{2 lo g ( 2/ δ )}{n}} \leq δ .

P {∣ μ_{n} - μ ∣ \geq σ \frac{2 lo g ( 2/ δ )}{n}} \leq δ .

\displaystyle\mathbb{P}\left\{|\widehat{\mu}_{n}-\mu|\geq\sigma\sqrt{\frac{1}{\delta n}}\bigg{(}1-\frac{e\delta}{n}\bigg{)}^{(n-1)/2}\right\}\geq\delta.

\displaystyle\mathbb{P}\left\{|\widehat{\mu}_{n}-\mu|\geq\sigma\sqrt{\frac{1}{\delta n}}\bigg{(}1-\frac{e\delta}{n}\bigg{)}^{(n-1)/2}\right\}\geq\delta.

y_{ik} = μ_{k} + x_{i}^{⊺} β_{k} + ε_{ik}, i = 1, \dots, n, k = 1, \dots, m,

y_{ik} = μ_{k} + x_{i}^{⊺} β_{k} + ε_{ik}, i = 1, \dots, n, k = 1, \dots, m,

H_{0 k} : μ_{k} = 0 \mbox v er s u s H_{1 k} : μ_{k} \neq = 0, \mbox f or k = 1, \dots, m .

H_{0 k} : μ_{k} = 0 \mbox v er s u s H_{1 k} : μ_{k} \neq = 0, \mbox f or k = 1, \dots, m .

\displaystyle\ell_{\tau}(u)=\left\{\begin{array}[]{ll}u^{2}/2,&\mbox{if }|u|\leq\tau,\\ \tau|u|-\tau^{2}/2,&\mbox{if }|u|>\tau,\end{array}\right.

\displaystyle\ell_{\tau}(u)=\left\{\begin{array}[]{ll}u^{2}/2,&\mbox{if }|u|\leq\tau,\\ \tau|u|-\tau^{2}/2,&\mbox{if }|u|>\tau,\end{array}\right.

θ_{τ} \in argmin_{θ \in R^{d}} L_{τ} (θ) \mbox w i t h L_{τ} (θ) = L_{n, τ} (θ) := i = 1 \sum n ℓ_{τ} (Y_{i} - X_{i}^{⊺} θ) .

θ_{τ} \in argmin_{θ \in R^{d}} L_{τ} (θ) \mbox w i t h L_{τ} (θ) = L_{n, τ} (θ) := i = 1 \sum n ℓ_{τ} (Y_{i} - X_{i}^{⊺} θ) .

Z_{i} = Σ^{- 1/2} X_{i}, i = 1, \dots, n .

Z_{i} = Σ^{- 1/2} X_{i}, i = 1, \dots, n .

\displaystyle\mathbb{P}\Bigg{\{}\|\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*})\|_{2}\geq c_{1}v\sqrt{\frac{d+t}{n}}\Bigg{\}}\leq 2e^{-t}

\displaystyle\mathbb{P}\Bigg{\{}\|\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*})\|_{2}\geq c_{1}v\sqrt{\frac{d+t}{n}}\Bigg{\}}\leq 2e^{-t}

\displaystyle\mbox{ and }~{}\mathbb{P}\Bigg{\{}\bigg{\|}\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*})-\frac{1}{n}\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})\bm{Z}_{i}\bigg{\|}_{2}\geq c_{2}v\frac{d+t}{n}\Bigg{\}}\leq 3e^{-t}

\displaystyle\bigg{|}\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\frac{1}{2}\bigg{\|}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})\bm{Z}_{i}\bigg{\|}_{2}^{2}\bigg{|}\leq c_{4}v^{2}\frac{(d+t)^{3/2}}{\sqrt{n}}

\displaystyle\bigg{|}\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\frac{1}{2}\bigg{\|}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})\bm{Z}_{i}\bigg{\|}_{2}^{2}\bigg{|}\leq c_{4}v^{2}\frac{(d+t)^{3/2}}{\sqrt{n}}

\displaystyle\mbox{ and }~{}\bigg{|}\sqrt{2\{\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})\}}-\bigg{\|}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})\bm{Z}_{i}\bigg{\|}_{2}\bigg{|}\leq c_{5}v\frac{d+t}{\sqrt{n}}

τ = v {n / (d + t)}^{η} \mbox f or an y η \in [1/ (2 + δ), 1/2] \mbox an d v \geq υ_{2 + δ}^{1/ (2 + δ)},

τ = v {n / (d + t)}^{η} \mbox f or an y η \in [1/ (2 + δ), 1/2] \mbox an d v \geq υ_{2 + δ}^{1/ (2 + δ)},

θ_{τ} = θ^{*} + \frac{1}{n} i = 1 \sum n ℓ_{τ}^{'} (ε_{i}) Σ^{- 1} X_{i} + O {n^{- 1} (d + lo g n)}

θ_{τ} = θ^{*} + \frac{1}{n} i = 1 \sum n ℓ_{τ}^{'} (ε_{i}) Σ^{- 1} X_{i} + O {n^{- 1} (d + lo g n)}

\displaystyle\rho_{\tau}(u)=\left\{\begin{array}[]{ll}u^{2}/2-u^{4}/(24\tau^{2}),&\mbox{if }|u|\leq\sqrt{2}\,\tau,\\ \frac{2\sqrt{2}}{3}\tau|u|-\tau^{2}/2,&\mbox{if }|u|>\sqrt{2}\,\tau.\end{array}\right.

\displaystyle\rho_{\tau}(u)=\left\{\begin{array}[]{ll}u^{2}/2-u^{4}/(24\tau^{2}),&\mbox{if }|u|\leq\sqrt{2}\,\tau,\\ \frac{2\sqrt{2}}{3}\tau|u|-\tau^{2}/2,&\mbox{if }|u|>\sqrt{2}\,\tau.\end{array}\right.

C_{α}^{*} (σ) := {θ \in R^{d} : L_{τ} (θ) - L_{τ} (θ_{τ}) \leq σ^{2} χ_{d, α}^{2} /2},

C_{α}^{*} (σ) := {θ \in R^{d} : L_{τ} (θ) - L_{τ} (θ_{τ}) \leq σ^{2} χ_{d, α}^{2} /2},

θ^{*}

θ^{*}

E (U_{i}) = 0, E (U_{i}^{2}) = 1, i = 1, \dots, n .

E (U_{i}) = 0, E (U_{i}^{2}) = 1, i = 1, \dots, n .

L_{τ}^{b} (θ) = i = 1 \sum n W_{i} ℓ_{τ} (Y_{i} - X_{i}^{⊺} θ), θ \in R^{d}

L_{τ}^{b} (θ) = i = 1 \sum n W_{i} ℓ_{τ} (Y_{i} - X_{i}^{⊺} θ), θ \in R^{d}

\mbox an d θ_{τ}^{b} \in argmin_{θ \in R^{d} : ∥ θ - θ_{τ} ∥_{2} \leq R} L_{τ}^{b} (θ),

z_{α}^{b} = in f {z \geq 0 : P^{*} {L_{τ}^{b} (θ_{τ}) - L_{τ}^{b} (θ_{τ}^{b}) > z} \leq α} .

z_{α}^{b} = in f {z \geq 0 : P^{*} {L_{τ}^{b} (θ_{τ}) - L_{τ}^{b} (θ_{τ}^{b}) > z} \leq α} .

z_{α} := in f {z \geq 0 : P {L_{τ} (θ^{*}) - L_{τ} (θ_{τ}) > z} \leq α} .

z_{α} := in f {z \geq 0 : P {L_{τ} (θ^{*}) - L_{τ} (θ_{τ}) > z} \leq α} .

P^{*} {∥ Σ^{1/2} (θ_{τ}^{b} - θ^{*}) ∥_{2} \geq c_{1} v (d + t)^{1/2} n^{- 1/2}} \leq 3 e^{- t},

P^{*} {∥ Σ^{1/2} (θ_{τ}^{b} - θ^{*}) ∥_{2} \geq c_{1} v (d + t)^{1/2} n^{- 1/2}} \leq 3 e^{- t},

\displaystyle\mathbb{P}^{*}\Bigg{\{}\bigg{\|}\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}}-\widehat{\bm{\theta}}_{\tau})-\frac{1}{n}\sum_{i=1}^{n}\ell_{\tau}^{\prime}(\varepsilon_{i})U_{i}\bm{Z}_{i}\bigg{\|}_{2}\geq

\displaystyle\mathbb{P}^{*}\Bigg{\{}\bigg{\|}\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}}-\widehat{\bm{\theta}}_{\tau})-\frac{1}{n}\sum_{i=1}^{n}\ell_{\tau}^{\prime}(\varepsilon_{i})U_{i}\bm{Z}_{i}\bigg{\|}_{2}\geq

ξ^{b} (θ) = Σ^{- 1/2} {\nabla L_{τ}^{b} (θ) - \nabla E^{*} L_{τ}^{b} (θ)}, θ \in R^{d} .

ξ^{b} (θ) = Σ^{- 1/2} {\nabla L_{τ}^{b} (θ) - \nabla E^{*} L_{τ}^{b} (θ)}, θ \in R^{d} .

ξ^{b} (θ) = Σ^{- 1/2} {\nabla L_{τ}^{b} (θ) - \nabla L_{τ} (θ)} = - i = 1 \sum n ℓ_{τ}^{'} (Y_{i} - X_{i}^{⊺} θ) U_{i} Z_{i}, θ \in R^{d} .

ξ^{b} (θ) = Σ^{- 1/2} {\nabla L_{τ}^{b} (θ) - \nabla L_{τ} (θ)} = - i = 1 \sum n ℓ_{τ}^{'} (Y_{i} - X_{i}^{⊺} θ) U_{i} Z_{i}, θ \in R^{d} .

\displaystyle\mathbb{P}^{*}\Bigg{[}\bigg{|}\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})-\frac{\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}^{2}}{2n}\bigg{|}\geq c_{5}v^{2}

\displaystyle\mathbb{P}^{*}\Bigg{[}\bigg{|}\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})-\frac{\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}^{2}}{2n}\bigg{|}\geq c_{5}v^{2}

\displaystyle\mathbb{P}^{*}\Bigg{[}\bigg{|}\sqrt{2\{\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})\}}-\frac{\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}}{\sqrt{n}}\bigg{|}\geq c_{6}v

\displaystyle\mathbb{P}^{*}\Bigg{[}\bigg{|}\sqrt{2\{\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})\}}-\frac{\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}}{\sqrt{n}}\bigg{|}\geq c_{6}v

τ = v {n / (d + t)}^{η} \mbox f or an y η \in [1/4, 1/2) \mbox an d v \geq υ_{4}^{1/4},

τ = v {n / (d + t)}^{η} \mbox f or an y η \in [1/4, 1/2) \mbox an d v \geq υ_{4}^{1/4},

∣ P {L_{τ} (θ^{*}) - L_{τ} (θ_{τ}) \leq z} - P^{*} {L_{τ}^{b} (θ_{τ}) - L_{τ}^{b} (θ_{τ}^{b}) \leq z} ∣ \leq Δ_{1} (n, d, t),

∣ P {L_{τ} (θ^{*}) - L_{τ} (θ_{τ}) \leq z} - P^{*} {L_{τ}^{b} (θ_{τ}) - L_{τ}^{b} (θ_{τ}^{b}) \leq z} ∣ \leq Δ_{1} (n, d, t),

Δ_{1} (n, d, t) = C {d^{3/2} n^{- 1/2} + d^{1/2} {(d + t) / n}^{1 - 2 η} + (d + t)^{3 η} n^{1/2 - 3 η}} + 7 e^{- t}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Statistical Methods and Models · Statistical Methods and Bayesian Inference

Full text

Robust Inference via Multiplier Bootstrap

Xi Chen and Wen-Xin Zhou Stern School of Business, New York University, New York, NY 10012, USA. E-mail: [email protected].Department of Mathematics, University of California, San Diego, La Jolla, CA 92093, USA. E-mail: [email protected]. Supported in part by NSF Grant DMS-1811376.

Abstract

This paper investigates the theoretical underpinnings of two fundamental statistical inference problems, the construction of confidence sets and large-scale simultaneous hypothesis testing, in the presence of heavy-tailed data. With heavy-tailed observation noise, finite sample properties of the least squares-based methods, typified by the sample mean, are suboptimal both theoretically and empirically. In this paper, we demonstrate that the adaptive Huber regression, integrated with the multiplier bootstrap procedure, provides a useful robust alternative to the method of least squares. Our theoretical and empirical results reveal the effectiveness of the proposed method, and highlight the importance of having inference methods that are robust to heavy tailedness.

Keywords: Confidence set; heavy-tailed data; multiple testing; multiplier bootstrap; robust regression; Wilks’ theorem.

1 Introduction

In classical statistical analysis, it is typically assumed that data are drawn from a Gaussian distribution. Although the normality assumption has been widely adopted to facilitate methodological development and theoretical analysis, Gaussian models could be an idealization of the complex random world. The non-Gaussian, or even heavy-tailed, character of the distribution of empirical data has been repeatedly observed in various domains, ranging from genomics, medical imaging to economics and finance. New challenges are thus brought to classical statistical methods. For linear models, regression estimators based on the least squares loss are suboptimal, both theoretically and empirically, in the presence of heavy-tailed errors. The necessity of robust alternatives to least squares was first noticed by Peter Huber in his seminal work “Robust Estimation of a Location Parameter” (Huber, 1964). Due to the growing complexity of modern data, the notion of robustness is becoming increasingly important in statistical analysis and finds its use in a wide range of applications. We refer to Huber and Ronchetti (2009) for an overview of robust statistics.

Although the past a few decades have witnessed the active development of rich statistical theory on robust estimation, robust statistical inference for heavy-tailed data has always been a challenging problem on which the extant literature has been somewhat silent. Fan, Hall and Yao (2007), Delaigle, Hall and Jin (2011) and Liu and Shao (2014) investigated robust inference that is confined to the Student’s $t$ -statistic. However, as pointed out by Devroye et al. (2016) (see Section 8 therein), sharp confidence estimation for heavy-tailed data in the finite sample set-up remains an open problem and a general methodology is still lacking. To that end, this paper makes a further step in studying confidence estimation from a robust perspective. In particular, under linear model with heavy-tailed errors, we address two fundamental problems: (1) confidence set construction for regression coefficients, and (2) large-scale multiple testing with the guarantee of false discovery rate control. The developed techniques provide mathematical underpinnings for a class of robust statistical inference problems. Moreover, sharp exponential-type bounds for the coverage probability of the confidence set are derived under weak moment assumptions.

1.1 Confidence sets

Consider the linear model $Y=\bm{X}^{\intercal}\bm{\theta}^{*}+\varepsilon$ , where $Y\in\mathbb{R}$ denotes the response variable, $\bm{X}\in\mathbb{R}^{d}$ is the (random) vector of covariates, $\bm{\theta}^{*}\in\mathbb{R}^{d}$ is the vector of regression coefficients and $\varepsilon$ represents the regression error satisfying $\mathbb{E}(\varepsilon|\bm{X})=0$ and $\sigma^{2}=\mathbb{E}(\varepsilon^{2}|\bm{X})<\infty$ . Assume that we observe a random sample $(Y_{1},\bm{X}_{1}),\ldots,(Y_{n},\bm{X}_{n})$ from $(Y,\bm{X})$ :

[TABLE]

The intercept term is implicitly assumed in model (1.1) by taking the first element of $\bm{X}_{i}$ to be one so that the first element of $\bm{\theta}^{*}$ becomes the intercept. The least squares estimator and its variations have been widely adopted to estimate $\bm{\theta}^{*}$ , which on many occasions achieve the minimax rate in terms of the mean squared error (MSE).

Although the MSE plays an important role in estimation, an estimator that is optimal in MSE might be suboptimal in terms of non-asymptotic deviation, which often relates to the construction of confidence intervals. For example, in the mean estimation problem, although the sample mean has an optimal minimax mean squared error among all mean estimators, its deviation is worse for non-Gaussian samples than for Gaussian ones, and the worst-case deviation is suboptimal when the sampling distribution has heavy tails (Catoni, 2012). More specifically, let $X_{1},\ldots,X_{n}$ be independent random variables from $X$ with mean $\mu$ and variance $\sigma^{2}>0$ . Consider the empirical mean $\widehat{\mu}_{n}=(1/n)\sum_{i=1}^{n}X_{i}$ , applying Chebyshev’s inequality delivers a polynomial-type deviation bound

[TABLE]

In addition, if the distribution of $X$ is sub-Gaussian, i.e. $\mathbb{E}\exp(\lambda X)\leq\exp(\sigma^{2}\lambda^{2}/2)$ for all $\lambda$ , then following the terminology in Devroye et al. (2016), $\widehat{\mu}_{n}$ becomes a sub-Gaussian estimator in the sense that

[TABLE]

Catoni (2012) established a lower bound for the deviations of $\widehat{\mu}_{n}$ when the sampling distribution is the least favorable in the class of all distributions with bounded variance: for any $\delta\in(0,e^{-1})$ , there is some distribution with mean $\mu$ and variance $\sigma^{2}$ such that an independent sample of size $n$ drawn from it satisfies

[TABLE]

This shows that the deviation bound obtained from Chebyshev’s inequality is essentially sharp under finite variance condition. The limitation of least squares method arises also in the regression setting, which triggers an outpouring of interest in developing sub-Gaussian estimators, from univariate mean estimation to multivariate or even high dimensional problems, for heavy-tailed data to achieve sharp deviation bounds from a non-asymptotic viewpoint. See, for example, Brownlees, Joly and Lugosi (2015), Minsker (2015, 2018), Hsu and Sabato (2016), Devroye et al. (2016), Catoni and Giulini (2017), Giulini (2017), Fan, Li and Wang (2017), Sun et al. (2018) and Lugosi and Mendelson (2019), among others. In particular, Fan, Li and Wang (2017) and Sun et al. (2018) proposed adaptive (regularized) Huber estimators with diverging robustification parameters (see Definition 2.1 in Section 2.1), and derived exponential-type deviation bounds when the error distribution only has finite variance. Their key observation is that the robustification parameter should adapt to the sample size, dimensionality and noise level for optimal tradeoff between bias and robustness.

All the aforementioned work studies robust estimation through concentration properties, that is, the robust estimator is tightly concentrated around the true parameter with high probability even when the sampling distribution has only a small number of finite moments. In general, concentration inequalities loose constant factors and may result in confidence intervals too wide to be informative. Therefore, an interesting and challenging open problem is how to construct tight confidence sets for $\bm{\theta}^{*}$ with finite samples of heavy-tailed data (Devroye et al., 2016).

This paper addresses this open problem by developing a robust inference framework with non-asymptotic guarantees. To illustrate the key idea, we focus on the classical setting where the parameter dimension $d$ is smaller than but is allowed to increase with the number of observations $n$ . Our approach integrates concentration properties of the adaptive Huber estimator (see Theorems 2.1 and 2.2) and the multiplier bootstrap method. The multiplier bootstrap, also known as the weighted bootstrap, is one of the most widely used resampling methods for constructing a confidence interval/set or for measuring the significance of a test. Its theoretical validity is typically guaranteed by the multiplier central limit theorem (van der Vaart and Wellner, 1996). We refer to Chatterjee and Bose (2005), Arlot, Blanchard and Roquain (2010), Chernozhukov, Chetverikov and Kato (2013, 2014), Spokoiny and Zhilova (2015) and Zhilova (2016) for the most recent progress in the theory and applications of the multiplier bootstrap. In particular, Spokoiny and Zhilova (2015) considered a multiplier bootstrap procedure for constructing likelihood-based confidence sets under a possible model misspecification. For a linear model with sub-Gaussian errors, their results validate the bootstrap procedure when applied to the ordinary least squares (OLS). With heavy-tailed errors in the regression model (1.1), we demonstrate how the adaptive Huber regression and the multiplier bootstrap can be integrated to construct robust and sharp confidence sets for the true parameter $\bm{\theta}^{*}$ with a given coverage probability. The validity of the bootstrap procedure in situations with a limited sample size, growing dimensionality and heavy-tailed errors is established. In all these theoretical results, we provide non-asymptotic bounds for the errors of bootstrap approximation. See Theorems 2.3 and 2.4 for finite sample properties of the bootstrap adaptive Huber estimator, including the deviation inequality, Bahadur representation and Wilks’ expansion.

An alternative robust inference method is based on the asymptotic theory developed in Zhou et al. (2018); see, for example, Theorems 2.2 and 2.3 therein. Since the asymptotic distribution of either the proposed robust estimator itself or the excess risk depends on $\sigma^{2}$ , a direct approach is to replace $\sigma^{2}$ by some sub-Gaussian variance estimator using Catoni’s method (Catoni, 2012) or the median-of-means technique (Minsker, 2015), with the advantage of being computationally fast. The disadvantage, however, is two-fold: first, constructing sub-Gaussian variance estimator involves another tuning parameter (for the problem of simultaneously testing $m$ regression models as discussed in the next section, variance estimation brings $m$ additional tuning parameters); secondly, because the squared heavy-tailed data is highly right-skewed, using the method in Catoni (2012) or Fan, Li and Wang (2017) tends to underestimate the variance, and the median-of-means method is numerically unstable for small or moderate samples. Both two methods were examined numerically in Zhou et al. (2018), while the multiplier bootstrap procedure, albeit being more computationally intensive, demonstrates the most desirable finite sample performance.

1.2 Simultaneous inference

In addition to building confidence sets for an individual parameter vector, multiple hypothesis testing is another important statistical problem with applications to many scientific fields, where thousands of tests are performed simultaneously (Dudoit and van der Laan, 2008; Efron, 2010). Gene microarrays comprise a prototypical example; there, each subject is automatically measured on tens of thousands of features. Together, the large number of tests together with heavy tailedness bring new challenges to conventional statistical methods, which, in this scenario, often suffer from low power to detect important features and poor reproducibility. Robust alternatives are thus needed for conducting large-scale multiple inference for heavy-tailed data.

In this section, we consider the multiple response regression model

[TABLE]

where $\mu_{k}$ is the intercept, $\bm{x}_{i}=(x_{i1},\ldots,x_{is})^{\intercal},\bm{\beta}_{k}=(\beta_{k1},\ldots,\beta_{ks})^{\intercal}\in\mathbb{R}^{s}$ are $s$ -dimensional vectors of random covariates and regression coefficients, respectively, and $\varepsilon_{ik}$ is the regression error. Since our main focus here is the inference for intercepts, we decompose the parameter vector $\bm{\theta}^{*}$ in (1.1) into two parts: the intercept $\mu_{k}$ and the slope $\bm{\beta}_{k}$ . Moreover, we use $\bm{x}_{i}$ in (1.2) to distinguish from $\bm{X}_{i}$ in (1.1). Write $\bm{y}_{i}=(y_{i1},\ldots,y_{im})^{\intercal}\in\mathbb{R}^{m}$ and let $\bm{\mu}=(\mu_{1},\ldots,\mu_{m})^{\intercal}\in\mathbb{R}^{m}$ be the vector of intercepts. Based on random samples $\{(\bm{y}_{i},\bm{x}_{i})\}_{i=1}^{n}$ from model (1.2), our goal is to simultaneously test the hypotheses

[TABLE]

An iconic example of model (1.2) is the linear pricing model, which subsumes the capital asset pricing model (CAPM) (Sharpe, 1964; Lintner, 1965) and the Fama-French three-factor model (Fama and French, 1993). The key implication from the multi-factor pricing theory is that for any asset $k$ , the intercept $\mu_{k}$ should be zero. It is then important to investigate if such a pricing theory, also known as the “mean-variance efficiency” pricing, can be validated by empirical data (Fan, Liao and Yao, 2015). According to the Berk and Green equilibrium (Berk and Green, 2004), inefficient pricing by the market may occur to a small proportion of exceptional assets, namely a very small fraction of the $\mu_{k}$ ’s are nonzero. To identify positive $\mu_{k}$ ’s by testing a large number of hypotheses simultaneously, Barras, Scaillet and Wermers (2010) and Lan and Du (2019) developed FDR controlling procedures for data coming from model (1.2), which can be applied to mutual fund selection in empirical finance. We refer to Friguet, Kloareg and Causeur (2009), Desai and Storey (2012), Fan, Han and Gu (2012) and Wang et al. (2017) for more examples from gene expression studies, where the goal is to identify features showing a biological signal of interest.

Despite the extensive research and wide application of this problem, existing least squares-based methods with normal calibration could fail when applied to heavy-tailed data with a small sample size. To address this challenge, we develop a robust bootstrap procedure for large-scale simultaneous inference, which achieves good numerical performance for a small or moderate sample size. Theoretically, we prove its validity on controlling the false discover proportion (FDP) (see Theorem 4.1).

Finally, we briefly comment on the computation issue. Fast computation of Huber regression is critical to our procedure since the multiplier bootstrap requires solving Huber loss minimization for at least hundreds of times. Ideally, a second order approach (e.g. Newton’s method) is preferred. However, the second order derivative of Huber loss does not exist everywhere. To address this issue, we adopt the damped semismooth Netwon method (Qi and Sun, 1999), which is a synergic integration of first and second order methods. The details are provided in Appendix D of the supplemental material.

1.3 Organization of the paper

The rest of the paper proceeds as follows. Section 2.1 presents a series of finite sample results of the adaptive Huber regression. Sections 2.2 and 2.3 contain, respectively, the description of the bootstrap procedure for building confidence sets and theoretical guarantees. Two data-driven schemes are proposed in Section 3 for choosing the tuning parameter in the Huber loss. In Section 4, we propose a robust bootstrap calibration method for multiple testing and investigate its theoretical property on controlling the FDP. The conclusions that are drawn in Sections 2 and 4 are illustrated numerically in Section 5. We conclude with a discussion in Section 6. The supplementary material contains all the proofs and additional simulation studies.

1.4 Notation

Let us summarize our notation. For every integer $k\geq 1$ , we use $\mathbb{R}^{k}$ to denote the the $k$ -dimensional Euclidean space. The inner product of any two vectors $\bm{u}=(u_{1},\ldots,u_{k})^{\intercal},\bm{v}=(v_{1},\ldots,v_{k})^{\intercal}\in\mathbb{R}^{k}$ is defined by $\bm{u}^{\intercal}\bm{v}=\langle\bm{u},\bm{v}\rangle=\sum_{i=1}^{k}u_{i}v_{i}$ . We use the notation $\|\cdot\|_{p},1\leq p\leq\infty$ for the $\ell_{p}$ -norms of vectors in $\mathbb{R}^{k}$ : $\|\bm{u}\|_{p}=(\sum_{i=1}^{k}|u_{i}|^{p})^{1/p}$ and $\|\bm{u}\|_{\infty}=\max_{1\leq i\leq k}|u_{i}|$ . For $k\geq 2$ , $\mathbb{S}^{k-1}=\{\bm{u}\in\mathbb{R}^{k}:\|\bm{u}\|_{2}=1\}$ denotes the unit sphere in $\mathbb{R}^{k}$ . Throughout this paper, we use bold capital letters to represent matrices. For $k\geq 2$ , $\bm{I}_{k}$ represents the identity/unit matrix of size $k$ . For any $k\times k$ symmetric matrix $\bm{A}\in\mathbb{R}^{k\times k}$ , $\|\bm{A}\|_{2}$ is the operator norm of $\bm{A}$ . We use $\overline{\lambda}_{\bm{A}}$ and $\underline{\lambda}_{\bm{A}}$ to denote the largest and smallest eigenvalues of $\bm{A}$ , respectively. For any two real numbers $u$ and $v$ , we write $u\vee v=\max(u,v)$ and $u\wedge v=\min(u,v)$ . For two sequences of non-negative numbers $\{a_{n}\}_{n\geq 1}$ and $\{b_{n}\}_{n\geq 1}$ , $a_{n}\lesssim b_{n}$ indicates that there exists a constant $C>0$ independent of $n$ such that $a_{n}\geq Cb_{n}$ ; $a_{n}\gtrsim b_{n}$ is equivalent to $b_{n}\lesssim a_{n}$ ; $a_{n}\asymp b_{n}$ is equivalent to $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ . For two numbers $C_{1}$ and $C_{2}$ , we write $C_{2}=C_{2}(C_{1})$ if $C_{2}$ depends only on $C_{1}$ . For any set $\mathcal{S}$ , we use ${\rm card}(\mathcal{S})$ and $|\mathcal{S}|$ to denote its cardinality, i.e. the number of elements in $\mathcal{S}$ .

2 Robust bootstrap confidence sets

2.1 Preliminaries

First, we present some finite sample properties of the adaptive Huber estimator, which are of independent interest and also sharpen the results in Sun et al. (2018).

Let us recall the definition of Huber loss,

Definition 2.1.

The Huber loss $\ell_{\tau}(\cdot)$ (Huber, 1964) is defined as

[TABLE]

where $\tau>0$ is a tuning parameter and will be referred to as the robustification parameter that balances bias and robustness.

The Huber estimator is defined as

[TABLE]

The following theorem provides a sub-Gaussian-type deviation inequality and a non-asymptotic Bahadur representation for $\widehat{\bm{\theta}}_{\tau}$ . The proof is given in the supplement. We first impose the moment conditions.

Assumption 1.

(i) There exists some constant $A_{0}>0$ such that for any $\bm{u}\in\mathbb{R}^{d}$ and $t\in\mathbb{R}$ , $\mathbb{P}(|\langle\bm{u},\bm{Z}\rangle|\geq A_{0}\|\bm{u}\|_{2}\cdot t)\leq 2\exp(-t^{2})$ , where $\bm{Z}=\bm{\Sigma}^{-1/2}\bm{X}$ and $\bm{\Sigma}=\mathbb{E}(\bm{X}\bm{X}^{\intercal})$ is positive definite. (ii) The regression error $\varepsilon$ satisfies $\mathbb{E}(\varepsilon|\bm{X})=0$ , $\mathbb{E}(\varepsilon^{2}|\bm{X})=\sigma^{2}$ and $\mathbb{E}(|\varepsilon|^{2+\delta}|\bm{X})\leq\upsilon_{2+\delta}$ almost surely for some $\delta\geq 0$ .

Part (i) of Condition 1 requires $\bm{X}$ to be a sub-Gaussian vector. Via one-dimensional marginal, this generalizes the concept of sub-Gaussian random variables to higher dimensions. Typical examples include: (i) Gaussian and Bernoulli random vectors, (ii) spherical random vector111A random vector $\bm{X}\in\mathbb{R}^{d}$ is said to have a spherical distribution if it is uniformly distributed on the Euclidean sphere in $\mathbb{R}^{d}$ with center at the origin and radius $\sqrt{d}$ . , (iii) random vector that is uniformly distributed on the Euclidean ball centered at the origin with radius $\sqrt{d}$ , and (iv) random vector that is uniformly distributed on the unit cube $[-1,1]^{d}$ . In all the above cases, the constant $A_{0}$ represents a dimension-free constant. We refer to Chapter 3.4 in Vershynin (2018) for detailed discussions of sub-Gaussian distributions in higher dimensions. Technically, this assumption is needed in order to derive an exponential-type concentration inequality for the quadratic form $\|\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})\bm{Z}_{i}\|_{2}$ , where

[TABLE]

Theorem 2.1.

Assume Condition 1 holds. For any $t>0$ and $v\geq\upsilon_{2+\delta}^{1/(2+\delta)}$ , the estimator $\widehat{\bm{\theta}}_{\tau}$ given in (2.4) with $\tau=v(\frac{n}{d+t})^{1/(2+\delta)}$ satisfies

[TABLE]

as long as $n\geq c_{3}(d+t)$ , where $c_{1}$ – $c_{3}$ are constants depending only on $A_{0}$ .

The non-asymptotic results in Theorem 2.1 reveal a new perspective for Huber’s method: to construct sub-Gaussian estimators for linear regression with heavy-tailed errors, the tuning parameter in the Huber loss should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. The resulting estimator is therefore referred to as the adaptive Huber estimator. Specifically, Theorem 2.1 provides the concentration property of the adaptive Huber estimator $\widehat{\bm{\theta}}_{\tau}$ and the Fisher expansion for the difference $\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*}$ . It improves Theorem 2.1 in Zhou et al. (2018) by sharpening the sample size scaling. The classical asymptotic results can be easily derived from the obtained non-asymptotic expansions. In the following theorem, we further study the concentration property of the Wilks’ expansion for the excess $\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})$ . This new result is directly related to the construction of confidence sets. See Theorem 2.3 below for its counterpart in the bootstrap world.

Theorem 2.2.

Assume Condition 1 holds. Then for any $t>0$ and $v\geq\upsilon_{2+\delta}^{1/(2+\delta)}$ , the estimator $\widehat{\bm{\theta}}_{\tau}$ with $\tau=v(\frac{n}{d+t})^{1/(2+\delta)}$ satisfies that with probability at least $1-3e^{-t}$ ,

[TABLE]

as long as $n\geq c_{3}(d+t)$ , where $c_{4},c_{5}>0$ are constants depending on $A_{0}$ .

Remark 1 (On the robustification parameter $\tau$ ).

Going through the proofs of Theorems 2.1 and 2.2, we see that the robustification parameter $\tau$ can be chosen as

[TABLE]

such that the conclusions (2.6)–(2.9) hold as long as $n\gtrsim d+t$ . This implies that the existence of higher moments of $\varepsilon$ increases the flexibility of choosing $\tau$ , whose order ranges from $(\frac{n}{d+t})^{1/(2+\delta)}$ to $(\frac{n}{d+t})^{1/2}$ . In practice, $\upsilon_{2+\delta}$ is unknown and thus brings difficulty in calibrating $\tau$ . Guided by the theoretical results, in Section 3 we propose a data-dependent procedure to choose $\tau$ .

Remark 2 (Sample size scaling).

The deviation inequalities in Theorems 2.1 and 2.2 hold under the scaling condition $n\gtrsim d+t$ , indicating that as many as $d+t$ samples are required to guarantee the finite sample properties of the estimator. Similar conditions are also imposed for Proposition 2.4 in Catoni (2012) and Theorem 3.1 in Audibert and Catoni (2011). In particular if $\mathbb{E}(\varepsilon^{2})<\infty$ , taking $t=\log n$ and $\tau\asymp(\frac{n}{d+t})^{1/2}$ , the corresponding estimator $\widehat{\bm{\theta}}_{\tau}$ satisfies

[TABLE]

with probability at least $1-O(n^{-1})$ under the scaling $n\gtrsim d$ . From an asymptotic viewpoint, this implies that if the dimension $d$ , as a function of $n$ , satisfies $d=o(n)$ as $n\to\infty$ , then for any deterministic vector $\bm{u}\in\mathbb{R}^{d}$ , the distribution of the linear contrast $\bm{u}^{\intercal}(\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*})$ coincides with that of $(1/n)\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})\bm{u}^{\intercal}\bm{\Sigma}^{-1}\bm{X}_{i}$ asymptotically.

Remark 3.

To achieve sub-Gaussian behavior, the choice of loss function is not unique. An alternative loss function, which is obtained from the influence function proposed by Catoni and Giulini (2017), is

[TABLE]

The function $\rho_{\tau}$ is convex, twice differentiable everywhere and has bounded derivative that $|\rho_{\tau}^{\prime}(u)|\leq(2\sqrt{2}/3)\tau$ for all $u$ . By modifying the proofs of Theorems 2.1 and 2.2, it can be shown that the theoretical properties of the adaptive Huber estimator remain valid for the estimator that minimizes the empirical $\rho_{\tau}$ -loss. Computationally, it can be solved via Newton’s method.

2.2 Multiplier bootstrap

In this section, we go beyond estimation and focus on robust inference. According to (2.9), the distribution of $2\{\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})\}$ is close to that of $(1/n)\|\sum_{i=1}^{n}\xi_{i}\bm{Z}_{i}\|_{2}^{2}$ provided that $d^{2}/n$ is small, where $\xi_{i}=\ell_{\tau}^{\prime}(\varepsilon_{i})$ . As we will see in the proof of Theorem 2.5 that, the truncated random variable $\xi_{i}$ has mean and variance approximately equal to [math] and $\sigma^{2}$ , respectively. Heuristically, the multivariate central limit theorem allows us to approximate the distribution of the normalized sum $n^{-1/2}\sum_{i=1}^{n}\xi_{i}\bm{Z}_{i}$ by $\mathcal{N}(\textbf{0},\sigma^{2}\bm{I}_{d})$ . If this were true, then the distribution of $2\{\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})\}$ is close to the scaled chi-squared distribution $\sigma^{2}\chi_{d}^{2}$ with $d$ degrees of freedom. This is in line with the asymptotic behavior of the likelihood ratio statistic that was studied in Wilks (1938). With sample size sufficiently large, this result allows to construct confidence sets for $\bm{\theta}^{*}$ using quantiles of $\chi_{d}^{2}$ : for any $\alpha\in(0,1)$ ,

[TABLE]

where $\chi^{2}_{d,\alpha}$ denotes the upper $\alpha$ -quantile of $\chi_{d}^{2}$ . Estimating the residual variance $\sigma^{2}$ in the construction of $\mathcal{C}_{\alpha}^{*}(\sigma)$ is even more challenging when the errors are heavy-tailed. Moreover, as argued in Spokoiny and Zhilova (2015), a possibly low speed of convergence of the likelihood ratio statistic makes the asymptotic Wilks’ result hardly applicable to the case of small or moderate samples. Motivated by these two concerns, we have the following goal:

[TABLE]

The results in Section 2.1 show that the adaptive Huber estimator provides a robust estimate of $\bm{\theta}^{*}$ in the sense that it admits sub-Gaussian-type deviations when the error distribution only has finite variance. To estimate the quantiles of the adaptive Huber estimator and to construct confidence set, we consider the use of multiplier bootstrap. Let $U_{1},\ldots,U_{n}$ be independent and identically distributed (IID) random variables that are independent of the observed data $\mathcal{D}_{n}:=\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ and satisfy

[TABLE]

With $W_{i}:=1+U_{i}$ denoting the random weights, the bootstrap Huber loss and bootstrap Huber estimator are defined, respectively, as

[TABLE]

where $R>0$ is a prespecified radius parameter. A simple observation is that $\mathbb{E}^{*}\{\mathcal{L}_{\tau}^{\flat}(\bm{\theta})\}=\mathcal{L}_{\tau}(\bm{\theta})$ , where $\mathbb{E}^{*}(\cdot):=\mathbb{E}(\cdot\,|\mathcal{D}_{n})$ is the conditional expectation given the observed data $\mathcal{D}_{n}$ . Therefore, $\widehat{\bm{\theta}}_{\tau}\in\mathop{\mathrm{argmin}}_{\bm{\theta}\in\mathbb{R}^{d}}\mathbb{E}^{*}\{\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\bm{\theta})\}$ and the difference $\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}_{\tau}^{\flat}(\widehat{\bm{\theta}}_{\tau}^{\flat})$ mimics $\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})$ .

Based on this idea, we propose a Huber regression based inference procedure in Algorithm 1, where the bootstrap threshold $z^{\mathrm{\tiny{b}}}_{\alpha}=z^{\mathrm{\tiny{b}}}_{\alpha}(\mathcal{D}_{n})$ approximates

[TABLE]

Here $\mathbb{P}$ is the probability measure with respect to the underlying data generating process.

2.3 Theoretical results

In this section, we present detailed theoretical results for the bootstrap adaptive Huber estimator, including the deviation inequality, non-asymptotic Bahadur representation (Theorem 2.3), and Wilks’ expansions (Theorem 2.4). Moreover, Theorems 2.5 and 2.6 establish the validity of the multiplier bootstrap for estimating quantiles of $\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})$ when the variance $\sigma^{2}$ is unknown. Proofs of the finite sample properties of the bootstrap estimator require new techniques and are more involved than those of Theorems 2.1 and 2.2. We leave them to the supplemental material.

Assumption 2.

$U_{1},\ldots,U_{n}$ are IID from a random variable $U$ satisfying $\mathbb{E}(U)=0$ , $\textnormal{var}(U)=1$ and $\mathbb{P}(|U|\geq t)\leq 2\exp(-t^{2}/A_{U}^{2})$ for all $t\geq 0$ .

Theorem 2.3.

Assume Condition 1 with $\delta=2$ and Condition 2 hold. For any $t>0$ and $v\geq\upsilon_{4}^{1/4}$ , the estimator $\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}$ with $\tau=v(\frac{n}{d+t})^{1/4}$ and $R\asymp v$ satisfies:

with probability (over $\mathcal{D}_{n}$ ) at least $1-5e^{-t}$ ,

[TABLE] 2. 2.

with probability (over $\mathcal{D}_{n}$ ) at least $1-6e^{-t}$ ,

[TABLE]

as long as $n\geq\max\{c_{3}\kappa_{\bm{\Sigma}}(d+t),c_{4}(d+t)^{2}\}$ , where $c_{1}$ – $c_{3}$ are positive constants depending on $(A_{0},A_{U})$ , $c_{4}=c_{4}(A_{0})>0$ and $\kappa_{\bm{\Sigma}}=\overline{\lambda}_{\bm{\Sigma}}/\underline{\lambda}_{\bm{\Sigma}}$ is the condition number of $\bm{\Sigma}$ .

The following theorem is a bootstrap version of Theorem 2.2. Define the random process

[TABLE]

From (2.5) and (2.15) we see that

[TABLE]

In particular, write $\bm{\xi}^{\mathrm{\tiny{b}}}=\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta}^{*})=-\sum_{i=1}^{n}\ell^{\prime}_{\tau}(\varepsilon_{i})U_{i}\bm{Z}_{i}$ .

Theorem 2.4.

Assume Condition 1 with $\delta=2$ and Condition 2 hold. For any $t>0$ and $v\geq\upsilon_{4}^{1/4}$ , the bootstrap estimator $\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}$ with $\tau=v(\frac{n}{d+t})^{1/4}$ and $R\asymp v$ satisfies that, with probability (over $\mathcal{D}_{n}$ ) at least $1-5e^{-t}$ ,

[TABLE]

and

[TABLE]

as long as $n\geq\max\{c_{3}\kappa_{\bm{\Sigma}}(d+t),c_{4}(d+t)^{2}\}$ , where $c_{5},c_{6}>0$ are constants depending only on $(A_{0},A_{U})$ .

The results (2.21) and (2.22) are non-asymptotic bootstrap versions of the Wilks’ and square-root Wilks’ phenomena. In particular, the latter indicates that the square-root excess $\sqrt{2\{\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})\}}$ is close to $n^{-1/2}\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}$ with high probability as long as the dimension $d$ of the parameter space satisfies the condition that $d^{2}/n$ is small.

Remark 4 (Order of robustification parameter).

Similar to Remark 2.10, now with finite fourth moment $\upsilon_{4}$ , the robustification parameter in Theorems 2.3 and 2.4 can be chosen as

[TABLE]

such that the same conclusions remain valid. Due to Lemma A.2 in the supplemental material, here we require $\eta$ to be strictly less than $1/2$ .

The next result validates the approximation of the distribution of $\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})$ by that of $\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})$ in the Kolmogorov distance. Recall that $\mathbb{P}^{*}(\cdot)=\mathbb{P}(\cdot\,|\mathcal{D}_{n})$ denotes the conditional probability given the observed data $\mathcal{D}_{n}=\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ .

Theorem 2.5.

Suppose Assumption 1 holds with $\delta=2$ and Condition 2 holds with $U\sim\mathcal{N}(0,1)$ . For any $t>0$ and $v\geq\upsilon_{4}^{1/4}$ , let $\tau=v(\frac{n}{d+t})^{\eta}$ for some $\eta\in[1/4,1/2)$ . Then, with probability (over $\mathcal{D}_{n}$ ) at least $1-6e^{-t}$ , it holds for any $z\geq 0$ that

[TABLE]

where

[TABLE]

with $C=C(A_{0},\sigma,\upsilon_{4},v)>0$ .

Theorem 2.5 is in parallel with and can be viewed as a partial extension of Theorem 2.1 in Spokoiny and Zhilova (2015) to the case of heavy-tailed errors. In particular, taking $\eta=1/4$ in Theorem 2.5 we see that the error term scales as $(d^{3}/n)^{1/4}$ , while in Spokoiny and Zhilova (2015) it is of order $(d^{3}/n)^{1/8}$ . The difference is due to the fact that the latter allows misspecified models as discussed in Remark A.2 therein. In some way, allowing asymmetric and heavy-tailed errors can be regarded as a particular form of misspecification, considering that the OLS is the maximum likelihood estimator at the normal model.

Remark 5 (Asymptotic result).

To make asymptotic statements, we assume $n\to\infty$ with an understanding that $d=d(n)$ depends on $n$ and possibly $d\to\infty$ as $n\to\infty$ . Theorem 2.5 can be used to show the bootstrap consistency, where the notion of consistency is the one that guarantees asymptotically valid inference. Specifically, it shows that when the dimension $d$ , as a function of $n$ , satisfies $d=o(n^{1/3})$ , then with $\tau\asymp(\frac{n}{d+\log n})^{\eta}$ for some $\eta\in[1/4,1/2)$ , it holds

[TABLE]

as $n\to\infty$ .

For any $\alpha\in(0,1)$ , let

[TABLE]

be the upper $\alpha$ -quantile of $\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})-\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau})$ under $\mathbb{P}^{*}$ , which serves as an approximate to the target value $z_{\alpha}$ given in (2.17). As a direct consequence of Theorem 2.5, the following result formally establishes the validity of the multiplier bootstrap for adaptive Huber regression with heavy-tailed error.

Theorem 2.6 (Validity of multiplier bootstrap).

Assume the conditions of Theorem 2.5 hold and take $\eta=1/4$ . Then, for any $\alpha\in(0,1)$ ,

[TABLE]

where $\Delta_{2}(n,d,t)=C\{(d+t)^{3}/n\}^{1/4}+16e^{-t}$ , where $C=C(A_{0},\sigma,\upsilon_{4},v)>0$ . In particular, taking $\tau\asymp(\frac{n}{d+\log n})^{1/4}$ , it holds

[TABLE]

provided that $d=d(n)$ satisfies $d=o(n^{1/3})$ as $n\to\infty$ .

3 Data-driven procedures for choosing $\tau$

The theoretical results in Sections 2.1 and 2.3 reveal that how the Huber-type estimator would perform under various idealized scenarios, as such providing guidance on the choice of the key tuning parameter, which is referred to as the robustification parameter that balances bias and robustness. For estimation purpose, we take $\tau=v(\frac{n}{d+t})^{1/2}$ with $v\geq\sigma$ ; and for bootstrap inference, we choose $\tau=v(\frac{n}{d+t})^{1/4}$ with $v\geq\upsilon_{4}^{1/4}$ . Since both $\sigma^{2}=\textnormal{var}(\varepsilon)$ and $\upsilon_{4}\geq\mathbb{E}(\varepsilon^{4})$ are typically unknown in practice, an intuitive approach is replace them by the empirical second and fourth moments of the residuals from the ordinary least squares (OLS) estimator, i.e. $\widehat{\sigma}^{2}:=(n-d)^{-1}\sum_{i=1}^{n}(Y_{i}-\bm{X}_{i}^{\intercal}\widehat{\bm{\theta}}_{{\rm ols}})^{2}$ and $\widehat{\upsilon}_{4}:=(n-d)^{-1}\sum_{i=1}^{n}(Y_{i}-\bm{X}_{i}^{\intercal}\widehat{\bm{\theta}}_{{\rm ols}})^{4}$ . This simple approach performs reasonably well empirically (see Section 5). However, when heavy tails may be a concern, $\widehat{\sigma}^{2}$ and $\widehat{\upsilon}_{4}$ are not good estimates of $\sigma^{2}$ and $\upsilon_{4}$ . In this section, we discuss two data-dependent methods for choosing the tuning parameter $\tau$ : the first one uses an adaptive technique based on Lepski’s method (Lepskiĭ, 1991), and the second method is inspired by the censored equation approach in Hahn, Kuelbs and Weiner (1990) which was originally introduced in pursing a more robust weak convergence theory for self-normalized sums.

3.1 Lepski-type method

Borrowing an idea from Minsker (2018), we first consider a simple adaptive procedure based on Lepski’s method. Let $v_{\min}$ and $v_{\max}$ be some crude preliminary lower and upper bounds for the residual standard deviation, that is, $v_{\min}\leq\sigma\leq v_{\max}$ . For some prespecified $a>1$ , let $v_{j}=v_{\min}a^{j}$ for $j=0,1,\ldots$ and define

[TABLE]

It is easy to see that the set $\mathcal{J}$ has its cardinality bounded by $|\mathcal{J}|\leq 1+\log_{a}(v_{\max}/v_{\min})$ . Accordingly, we define a sequence of candidate parameters $\{\tau_{j}=v_{j}(\frac{n}{d+t})^{1/2},j\in\mathcal{J}\}$ and let $\widehat{\bm{\theta}}^{(j)}$ be the Huber estimator with $\tau=\tau_{j}$ . Set

[TABLE]

for some constant $c_{0}>0$ . The resulting adaptive estimator is then defined as $\widehat{\bm{\theta}}_{{\rm L}}=\widehat{\bm{\theta}}^{(\widehat{j}_{\rm L})}$ .

Theorem 3.1.

Assume that $c_{0}\geq 2c_{1}\underline{\lambda}_{\bm{\Sigma}}^{-1/2}$ for $c_{1}>0$ as in Theorem 2.1. Then for any $t>0$ ,

[TABLE]

with probability at least $1-3\log_{a}(av_{\max}/v_{\min})e^{-t}$ , provided $n\gtrsim d+t$ .

Lepski’s adaptation method serves a general technique to select the “best” estimator from a collection of certified candidates. The selected estimator adapts to the unknown noise level and satisfies near-optimal probabilistic bounds, while the associated parameter is not necessarily the theoretically optimal one. When applied with the bootstrap, Theorem 2.6 suggests that the dependence on $d/n$ should be slightly adjusted. Since the reuse of the sample brings a big challenge mathematically, we shall prove a theoretical result for the data-driven multiplier bootstrap procedure with sample splitting. However, to avoid notational clutter, we state a two-step procedure without sample splitting, but with the assumption that the second step is carried out on an independent sample.

A Two-Step Data-Driven Multiplier Bootstrap.

Step 1. Given independent observations $\{(Y^{(1)}_{i},\bm{X}^{(1)}_{i})\}_{i=1}^{n}$ from linear model (1.1), first we produce a robust pilot estimator using Lepski’s method. Recall that Lepski’s method requires initial crude upper and lower bounds for $\upsilon_{4}\geq\mathbb{E}(\varepsilon^{4})$ . Let $\mu_{Y}=\mathbb{E}(Y)$ and note that $\upsilon_{Y}:=\mathbb{E}(Y-\mu_{Y})^{4}>\upsilon_{4}$ . We shall use the median-of-means (MOM) estimator of $\upsilon_{Y}$ as a proxy, which is tuning-free in the sense that the construction does not depend on the noise level (Minsker, 2015). Specifically, we divide the index set $\{1,\ldots,n\}$ into $m\geq 2$ disjoint, equal-length groups $G_{1},\ldots,G_{m}$ , assuming $n$ is divisible by $m$ . For $j=1,\ldots,m$ , compute the empirical 4th moment evaluated over observations in group $j$ : $\widehat{\upsilon}_{Y,j}=(1/|G_{j}|)\sum_{i\in G_{j}}\{Y^{(1)}_{i}-\bar{Y}^{(1)}_{G_{j}}\}^{4}$ with $\bar{Y}^{(1)}_{G_{j}}=(1/|G_{j}|)\sum_{i\in G_{j}}Y^{(1)}_{i}$ . The MOM estimator of $\upsilon_{Y}$ is then defined by $\widehat{\upsilon}_{Y,{\rm mom}}={\rm median}\{\widehat{\upsilon}_{Y,1},\ldots,\widehat{\upsilon}_{Y,m}\}$ .

Take $v_{\max}=(2\widehat{\upsilon}_{Y,{\rm mom}})^{1/4}$ and $v_{\min}=a^{-K}v_{\max}$ for some integer $K\geq 1$ and $a>1$ . Denote $v_{j}=a^{j}v_{\min}$ for $j=0,1,\ldots$ , so that $\mathcal{J}=\{j\in\mathbb{Z}:v_{\min}\leq v_{j}<av_{\max}\}=\{0,1,\ldots,K\}$ . Slightly different from above, now we consider a sequence of parameters $\{\tau_{j}=v_{j}(\frac{n}{d+\log n})^{1/4}\}_{j\in\mathcal{J}}$ and let $\widetilde{\bm{\theta}}^{(j)}$ be the Huber estimator with $\tau=\tau_{j}$ . Set

[TABLE]

for some constant $c_{0}>0$ . Denote by $\widehat{\bm{\theta}}^{(1)}=\widetilde{\bm{\theta}}^{(\widetilde{j})}$ the corresponding estimator and put $\widehat{\tau}=\tau_{\widetilde{j}}$ .

Step 2. Taking $\widehat{\bm{\theta}}^{(1)}$ and $\widehat{\tau}$ from Step 1, next we apply the multiplier bootstrap procedure to a new sample $(Y_{i}^{(2)},\bm{X}^{(2)}_{i})\}_{i=1}^{n}$ that is independent from the previous one. Similarly to (2.4) and (2.16), define

[TABLE]

where $\widehat{\mathcal{L}}(\bm{\theta})=\sum_{i=1}^{n}\ell_{\widehat{\tau}}(Y^{(2)}_{i}-\langle\bm{X}_{i}^{(2)},\bm{\theta}\rangle)$ , $\widehat{\mathcal{L}}^{\mathrm{\tiny{b}}}(\bm{\theta})=\sum_{i=1}^{n}W_{i}\,\ell_{\widehat{\tau}}(Y_{i}^{(2)}-\langle\bm{X}_{i}^{(2)},\bm{\theta}\rangle)$ and $\widehat{R}=\widehat{\tau}(\frac{d+\log n}{n})^{1/4}$ . With the above preparations, we apply Algorithm 1 to construct the confidence set $\widehat{\mathcal{C}}_{\alpha}=\{\bm{\theta}\in\mathbb{R}^{d}:\widehat{\mathcal{L}}(\bm{\theta})-\widehat{\mathcal{L}}(\widehat{\bm{\theta}})\leq\widehat{z}^{\mathrm{\tiny{b}}}_{\alpha}\}$ , where

[TABLE]

with $\bar{\mathcal{D}}_{n}=\{(Y_{i}^{(1)},\bm{X}_{i}^{(1)}),(Y_{i}^{(2)},\bm{X}_{i}^{(2)})\}_{i=1}^{n}$ .

Theorem 3.2.

Assume $\bar{\mathcal{D}}_{n}$ is an independent sample from $(Y,\bm{X})$ satisfying Condition 1 and moreover, $\mathbb{E}(|\varepsilon|^{4+\delta})\leq\upsilon_{4+\delta}$ for some $\delta>0$ . Let $W_{1},\ldots,W_{n}$ be IID $\mathcal{N}(1,1)$ random variables that are independent of $\bar{\mathcal{D}}_{n}$ . Assume further that $d=d(n)$ satisfies $d=o(n^{1/3})$ as $n\to\infty$ . Then, for any $\alpha\in(0,1)$ , the confidence set $\widehat{\mathcal{C}}_{\alpha}$ obtained by the two-step multiplier bootstrap procedure with $m=\lfloor 8\log n+1\rfloor$ and $K\geq\lfloor\log_{a}(3\upsilon_{Y}/\upsilon_{4})^{1/4}\rfloor+1$ satisfies $\mathbb{P}(\bm{\theta}^{*}\in\widehat{\mathcal{C}}_{\alpha})\to 1-\alpha$ as $n\to\infty$ .

The proof of Theorem 3.2 will be provided in Section C.2 in the supplementary material.

3.2 Huber-type method

In Huber’s original proposal, robust location estimation with desirable efficiency also depends on the scale parameter $\sigma$ . For example, in Huber’s Proposal 2 (Huber, 1964), the location $\mu$ and scale $\sigma$ are estimated simultaneously by solving a system of ”likelihood equations”. Similarly in spirit, we propose a new data-driven tuning scheme to calibrate $\tau$ by solving a so-called censored equation (Hahn, Kuelbs and Weiner, 1990) instead of likelihood equation. We first consider mean estimation to illustrate the main idea, and then move forward to the regression problem. Due to space limitations, we leave some discussions and proofs of the theoretical results to Appendix E in the supplemental material.

3.2.1 Motivation: truncated mean

Let $X_{1},\ldots,X_{n}$ be IID random variables from $X$ with mean $\mu$ and variance $\sigma^{2}>0$ . Without loss of generality, we first assume $\mu=0$ . Catoni (2012) proved that the worst case deviations of the sample mean $\bar{X}_{n}$ are suboptimal with heavy-tailed data (see Appendix E.2). To attenuate the erratic fluctuations in $\bar{X}_{n}$ , it is natural to consider the truncated sample mean

[TABLE]

where

[TABLE]

and $\tau$ is a tuning parameter that balances between bias and robustness. To see this, let $\mu_{\tau}=\mathbb{E}(\widehat{m}_{\tau})$ be the truncated mean. By Markov’s inequality, the bias term can be controlled by

[TABLE]

The robustness of $\widehat{m}_{\tau}$ , on the other hand, can be characterized via the deviation

[TABLE]

The following result shows that with a properly chosen $\tau$ , the truncated sample mean achieves a sub-Gaussian performance under the finite variance condition. Moreover, uniformity of the rate over a neighborhood of the optimal tuning scale requires an additional $\log(n)$ -factor. For every $\tau>0$ , define the truncated second moment

[TABLE]

Proposition 3.1.

For any $1\leq t<n\mathbb{P}(|X|>0)$ , let $\tau_{t}>0$ be the solution to

[TABLE]

(i)

With probability at least $1-2e^{-t}$ , $\widehat{m}_{\tau_{t}}$ satisfies

[TABLE] 2. (ii)

With probability at least $1-2e^{\log n-t}$ ,

[TABLE]

where $C_{t}:=\sup_{\sigma_{\tau_{t}}/2\leq c\leq 3\sigma_{\tau_{t}}/2}\{\sigma_{c(n/t)^{1/2}}\sqrt{2}-c^{-1}\sigma_{c(n/t)^{1/2}}^{2}+c/3+c^{-1}\sigma^{2}\}\leq\sqrt{2}\sigma+2\sigma^{2}/\sigma_{\tau_{t}}+\sigma_{\tau_{t}}/6$ .

The next result establishes existence and uniqueness of the solution to equation (3.8).

Proposition 3.2.

(i). Provided $0<t<n\mathbb{P}(|X|>0)$ , equation (3.8) has a unique solution, denoted by $\tau_{t}$ , which satisfies

[TABLE]

where $q_{\alpha}:=\inf\{z:\mathbb{P}(|X|>z)\leq\alpha\}$ is the upper $\alpha$ -quantile of $|X|$ . (ii). Let $t=t_{n}>0$ satisfy $t_{n}\to\infty$ and $t=o(n)$ . Then $\tau_{t}\to\infty$ , $\sigma_{\tau_{t}}\to\sigma$ and $\tau_{t}\sim\sigma\sqrt{n/t}$ as $n\to\infty$ .

According to Proposition 3.1, an ideal $\tau$ is such that the sample mean of truncated data $\psi_{\tau}(X_{1}),\ldots,\psi_{\tau}(X_{n})$ is tightly concentrated around the true mean. At the same time, it is reasonable to expect that the empirical second moment of $\psi_{\tau}(X_{i})$ ’s provides an adequate estimate of $\sigma_{\tau}^{2}=\mathbb{E}\{\psi_{\tau}^{2}(X)\}$ . Motivated by this observation, we propose to choose $\tau$ by solving the equation

[TABLE]

or equivalently, solving

[TABLE]

Equation (3.11) is the sample version of (3.8). Provided the solution exists and is unique, denoted by $\widehat{\tau}_{t}$ , we obtain a data-driven estimator

[TABLE]

As a direct consequence of Proposition 3.2, the following result ensures existence and uniqueness of the solution to equation (3.11).

Proposition 3.3.

Provided $0<t<\sum_{i=1}^{n}I(|X_{i}|>0)$ , equation (3.11) has a unique solution.

Throughout, we use $\widehat{\tau}_{t}$ to denote the solution to equation (3.11), which is unique and positive when $t<\sum_{i=1}^{n}I(|X_{i}|>0)$ . For completeness, we set $\widehat{\tau}_{t}=0$ if $t\geq\sum_{i=1}^{n}I(|X_{i}|>0)$ . If $\mathbb{P}(X=0)=0$ , then $\widehat{\tau}_{t}>0$ with probability one as long as $0<z<n$ . In the special case of $t=1$ , since $\psi_{\widehat{\tau}_{1}}(X_{i})=X_{i}$ for all $i=1,\ldots,n$ , equation (3.11) has a unique solution $\widehat{\tau}_{1}=(\sum_{i=1}^{n}X_{i}^{2})^{1/2}$ . With both $\tau_{t}$ and $\widehat{\tau}_{t}$ well defined, next we investigate the statistical property of $\widehat{\tau}_{t}$ .

Theorem 3.3.

Assume that $\textnormal{var}(X)<\infty$ and $\mathbb{P}(X=0)=0$ . For any $1\leq t<n$ and $0<r<1$ , we have

[TABLE]

where

[TABLE]

where $P(z)=\mathbb{E}\{X^{2}I(|X|\leq z)\}$ and $Q(z)=\mathbb{E}\{\psi^{2}_{z}(X)\}$ for $z\geq 0$ .

More properties of functions $P(z)$ and $Q(z)$ can be found in Appendix E.1 in the supplement.

Remark 6.

We discuss some direct implications of Theorem 3.3.

(i)

Let $t=t_{n}\geq 1$ satisfy $t\to\infty$ and $t=o(n)$ as $n\to\infty$ . By Proposition 3.2, $\tau_{t}\to\infty$ , $\sigma_{\tau_{t}}\to\sigma$ and $\tau_{t}\sim\sigma\sqrt{n/t}$ , which further implies $P(\tau_{t})\to\sigma$ and $Q(\tau_{t})\to\sigma$ as $n\to\infty$ . 2. (ii)

With $r=1/2$ and $t=(\log n)^{1+\kappa}$ for some $\kappa>0$ in (3.13), the constants $a_{1}=a_{1}(t,1/2)$ and $a_{2}=a_{2}(t,1/2)$ satisfy $a_{1}\to 5/9$ and $a_{2}\to 3/2$ as $n\to\infty$ . The resulting $\widehat{\tau}_{t}$ satisfies that with probability approaching one, $\tau_{t}/2\leq\widehat{\tau}_{t}\leq 3\tau_{t}/2$ .

The following result, which is a direct consequence of (3.10), Theorem 3.3 and Remark 6, shows that the data-driven estimator $\widehat{m}_{\widehat{\tau}_{t}}$ with $t=(\log n)^{1+\kappa}$ ( $\kappa>0$ ) is tightly concentrated around the mean with high probability.

Corollary 1.

Assume the conditions of Theorem 3.3 hold. Then, the truncated mean $\widehat{m}=\widehat{m}_{\widehat{\tau}_{t}}$ with $t=(\log n)^{1+\kappa}$ for some $\kappa>0$ satisfies $|\widehat{m}|\leq c_{1}\sqrt{(\log n)^{1+\kappa}/n}$ with probability greater than $1-c_{2}n^{-1}$ as $n\to\infty$ , where $c_{1},c_{2}>0$ are constants independent of $n$ .

3.2.2 Huber’s mean estimator

For the truncated sample mean, even with the theoretically optimal tuning parameter, the deviation of the estimator only scales with the second moment rather than the ideal scale $\sigma$ . Indeed, the truncation method described above primarily serves as a heuristic device and paves the way for developing data-driven Huber estimators.

Given IID samples $X_{1},\ldots,X_{n}$ with mean $\mu$ and variance $\sigma^{2}$ , recall the Huber estimator $\widehat{\mu}_{\tau}=\mathop{\mathrm{argmin}}_{\theta}\sum_{i=1}^{n}\ell_{\tau}(X_{i}-\theta)$ , which is also the unique solution to

[TABLE]

The non-asymptotic property of $\widehat{\mu}_{\tau}$ is characterized by a Bahadur-type representation result developed in Zhou et al. (2018): for any $t>0$ , $\widehat{\mu}_{\tau}$ with $\tau=\sigma\sqrt{n/t}$ satisfies the bound $|\widehat{\mu}_{\tau}-\mu-(1/n)\sum_{i=1}^{n}\psi_{\tau}(\varepsilon_{i})|\leq c_{1}\sigma t/\sqrt{n}$ with probability at least $1-3e^{-t}$ provided $n\geq c_{2}t$ , where $c_{1},c_{2}>0$ are absolute constants and $\varepsilon_{i}=X_{i}-\mu$ are noise variables. In other words, a properly chosen $\tau$ is such that the truncated average $(1/n)\sum_{i=1}^{n}\psi_{\tau}(\varepsilon_{i})$ is resistant to outliers caused by a heavy-tailed ‘noise’. Similar to (3.11), now we would like to choose the robustification parameter by solving

[TABLE]

which is practically impossible as $\varepsilon_{i}$ ’s are unobserved realized noise. In light of (3.15) and (3.16), and motivated by Huber’s Proposal 2 [page 96 in Huber (1964)] for the simultaneous estimation of location and scale, we propose to estimate $\mu$ and calibrate $\tau$ by solving the following system of equations

[TABLE]

This method of simultaneous estimation can be naturally extended to the regression setting, as discussed in the next section.

A different while comparable proposal is a two-step method, namely $M$ -estimation of $\mu$ with auxiliary robustification parameter computed separately by solving

[TABLE]

It is, however, less clear that how this method can be generalized to the regression problem. Therefore, our focus will be on the previous approach.

3.2.3 Data-driven Huber regression

Consider the linear model $Y_{i}=\bm{X}_{i}^{\intercal}\bm{\theta}^{*}+\varepsilon_{i}$ as in (1.1) and the Huber estimator $\widehat{\bm{\theta}}_{\tau}=\mathop{\mathrm{argmin}}_{\bm{\theta}\in\mathbb{R}^{d}}\mathcal{L}_{\tau}(\bm{\theta})$ , where $\mathcal{L}_{\tau}(\bm{\theta})=\sum_{i=1}^{n}\ell_{\tau}(Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta})$ . From the deviation analysis in (2.1) we see that to achieve the sub-Gaussian performance bound, the theoretically desirable tuning parameter for $\widehat{\bm{\theta}}_{\tau}$ is $\tau\sim\sigma\sqrt{n/(d+t)}$ with $\sigma^{2}=\textnormal{var}(\varepsilon_{i})$ . Further, by the Bahadur representation (2.7),

[TABLE]

where the remainder $\mathcal{R}_{\tau}$ is of the order $\sigma(d+t)/n$ with exponentially high probability. This result demonstrates that the robustness is essentially gained from truncating the errors. Motivated by this representation and our discussions in Section 3.2.1, a robust tuning scheme is to find $\tau$ such that

[TABLE]

Unlike the mean estimation problem, the realized noises $\varepsilon_{i}$ are unobserved. It is therefore natural to calibrate $\tau$ using fitted residuals. On the other hand, for a given $\tau>0$ , the Huber loss minimization is equivalent to the following least squares problem with variable weights:

[TABLE]

where the minimization is over $w_{i}\geq 0$ and $\bm{\theta}\in\mathbb{R}^{d}$ . This equivalence can be derived by writing down the KKT conditions of (3.18). Details will be provided in Remark 7 below. By (3.18), this problem can be solved via the iteratively reweighted least squares method.

To summarize, we propose an iteratively reweighted least squares algorithm, which starts at iteration 0 with an initial estimate $\bm{\theta}^{(0)}=\widehat{\bm{\theta}}_{{\rm ols}}$ (the least squares estimator) and involves three steps at each iteration.

Calibration: Using the current estimate $\bm{\theta}^{(k)}$ , we compute the vector of residuals $\bm{R}^{(k)}=(R_{1}^{(k)},\ldots,R_{n}^{(k)})^{\intercal}$ , where $R_{i}^{(k)}=Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta}^{(k)}$ . Then we take $\tau^{(k)}$ as the solution to

[TABLE]

By Proposition 3.3, this equation has a unique positive solution provided $d+t<\sum_{i=1}^{n}I(|R_{i}^{(k)}|>0)$ .

Weighting: Compute the vector of weights ${\bm{w}}^{(k)}=(w_{1}^{(k)},\ldots,w_{n}^{(k)})^{\intercal}$ , where $w_{i}^{(k)}=|R_{i}^{(k)}|/\tau^{(k)}-1$ if $|R_{i}^{(k)}|>\tau^{(k)}$ and $w_{i}^{(k)}=0$ if $|R_{i}^{(k)}|\leq\tau^{(k)}$ . Then define the diagonal matrix $\textbf{W}^{(k)}={\rm diag}((1+w_{1}^{(k)})^{-1},\ldots,(1+w_{n}^{(k)})^{-1})$ .

Weighted least squares: Solve the weighted least squares problem (3.18) with $w_{i}=w_{i}^{(k)}$ and $\tau=\tau^{(k)}$ to obtain

[TABLE]

where $\textbf{X}=(\bm{X}_{1},\ldots,\bm{X}_{n})^{\intercal}\in\mathbb{R}^{n\times d}$ and ${\bm{Y}}=(Y_{1},\ldots,Y_{n})^{\intercal}$ .

Repeat the above three steps until convergence or until the maximum number of iterations is reached.

In addition, from Theorems 2.3–2.5 we see that the validity of the multiplier bootstrap procedure requires a finite fourth moment condition, under which the ideal choice of $\tau$ is $\{\upsilon_{4}n/(d+t)\}^{1/4}$ . To construct data-dependent robust bootstrap confidence set, we adjust equation (3.19) by replacing $R_{i}^{(k)2}$ and $\tau^{2}$ therein with $R_{i}^{(k)4}$ and $\tau^{4}$ , and solve instead

[TABLE]

Keep the other two steps and repeat until convergence or the maximum number of iterations is reached. Let $\widehat{\bm{\theta}}_{\widehat{\tau}}$ and $\widehat{\tau}$ be the obtained solutions. Then, we apply Algorithm 1 with $\tau=\widehat{\tau}$ to construct confidence sets.

Finally we discuss the choice of $t$ . Since $t$ appears in both the deviation bound and confidence level, we let $t=t_{n}$ slowly grow with the sample size to gain robustness without compromising unbiasedness. We take $t=\log n$ , a typical slowly growing function of $n$ , in all the numerical experiments carried out in this paper.

Remark 7 (Equivalence between (3.18) and Huber regression).

For a given $\bm{\theta}$ in (3.18), define $R_{i}=Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta}$ , $i=1,\ldots,n$ . The KKT condition of the program (3.18) with respect to each $w_{i}$ under the constraint $w_{i}\geq 0$ now reads:

[TABLE]

where $\lambda_{i}$ is the Lagrangian multiplier. The solution to the KKT condition takes the form:

[TABLE]

This gives the optimal solution of $w_{i}$ . By plugging the optimal solution of $w_{i}$ back into (3.18), we obtain the following optimization with respect to $\bm{\theta}$ :

[TABLE]

which is equivalent to Huber regression.

4 Multiple inference with multiplier bootstrap calibration

In this section, we apply the adaptive Huber regression with multiplier bootstrap to simultaneously test the hypotheses in (1.3). Given a random sample $(\bm{y}_{1},\bm{x}_{1}),\ldots,(\bm{y}_{n},\bm{x}_{n})$ from the multiple response regression model (1.2), we define robust estimators

[TABLE]

where $\tau_{k}$ ’s are robustification parameters.

To conduct simultaneous inference for $\mu_{k}$ ’s, we use the multiplier bootstrap to approximate the distribution of $\widehat{\mu}_{k}-\mu_{k}$ . Let $W$ be a random variable with unit mean and variance. Independent of $\{(\bm{y}_{i},\bm{x}_{i})\}_{i=1}^{n}$ , let $\{W_{ik},1\leq i\leq n,1\leq k\leq m\}$ be IID from $W$ . Define the multiplier bootstrap estimators

[TABLE]

where $\widehat{\bm{\theta}}_{k}=(\widehat{\mu}_{k},\widehat{\bm{\beta}}_{k}^{\intercal})^{\intercal}$ and $R_{k}$ ’s are radius parameters. We will show that the unknown distribution of $\sqrt{n}\,(\widehat{\mu}_{k}-\mu_{k})$ can be approximated by the conditional distribution of $\sqrt{n}\,(\widehat{\mu}_{k}^{\mathrm{\tiny{b}}}-\widehat{\mu}_{k})$ given $\mathcal{D}_{kn}:=\{(y_{ik},\bm{x}_{i})\}_{i=1}^{n}$ .

The main result of this section establishes validity of the multiplier bootstrap on controlling the FDP in multiple testing. For $k=1,\ldots,m$ , define test statistics $\widehat{T}_{k}=\sqrt{n}\,\widehat{\mu}_{k}$ and the corresponding bootstrap $p$ -values $p^{\mathrm{\tiny{b}}}_{k}=G^{\mathrm{\tiny{b}}}_{k}(|\widehat{T}_{k}|)$ , where $G^{\mathrm{\tiny{b}}}_{k}(z):=\mathbb{P}(\sqrt{n}\,|\widehat{\mu}_{k}^{\mathrm{\tiny{b}}}-\widehat{\mu}_{k}|\geq z|\mathcal{D}_{kn})$ , $z\geq 0$ . For any given threshold $t\in(0,1)$ , the false discovery proportion is defined as

[TABLE]

where $V(t)=\sum_{k\in\mathcal{H}_{0}}I(p^{\mathrm{\tiny{b}}}_{k}\leq t)$ is the number of false discoveries, $R(t)=\sum_{k=1}^{m}I(p^{\mathrm{\tiny{b}}}_{k}\leq t)$ is the number of total discoveries and $\mathcal{H}_{0}:=\{k:1\leq k\leq m,\mu_{k}=0\}$ is the set of true null hypotheses. For any prespecified $\alpha\in(0,1)$ , applying the the Benjamini and Hochberg (BH) method (Benjamini and Hochberg, 1995) to the bootstrap $p$ -values $p^{\mathrm{\tiny{b}}}_{1},\ldots,p^{\mathrm{\tiny{b}}}_{m}$ induces a data-dependent threshold

[TABLE]

We reject the null hypotheses for which $p^{\mathrm{\tiny{b}}}_{k}\leq t^{\mathrm{\tiny{b}}}_{{\rm BH}}$ .

Assumption 3.

$(\bm{y}_{1},\bm{x}_{1}),\ldots,(\bm{y}_{n},\bm{x}_{n})$ are IID observations from $(\bm{y},\bm{x})$ that satisfies $\bm{y}=\bm{\mu}+\bm{\Gamma}\bm{x}+\bm{\epsilon}$ , where $\bm{y}=(y_{1},\ldots,y_{m})^{\intercal}$ , $\bm{\mu}=(\mu_{1},\ldots,\mu_{m})^{\intercal}$ , $\bm{\Gamma}=(\bm{\beta}_{1},\ldots,\bm{\beta}_{m})^{\intercal}\in\mathbb{R}^{m\times s}$ and $\bm{\epsilon}=(\varepsilon_{1},\ldots,\varepsilon_{m})^{\intercal}$ . The random vector $\bm{x}\in\mathbb{R}^{s}$ satisfies $\mathbb{E}(\bm{x})=\textbf{0}$ , $\mathbb{E}(\bm{x}\bm{x}^{\intercal})=\bm{\Sigma}$ and $\mathbb{P}(|\langle\bm{u},\bm{\Sigma}^{-1/2}\bm{x}\rangle|\geq t)\leq 2\exp(-t^{2}\|\bm{u}\|_{2}^{2}/A_{0}^{2})$ for all $\bm{u}\in\mathbb{R}^{s}$ , $t\in\mathbb{R}$ and some constant $A_{0}>0$ . Independent of $\bm{x}$ , the noise vector $\bm{\epsilon}$ has independent elements and satisfies $\mathbb{E}(\bm{\epsilon})=\textbf{0}$ and $c_{l}\leq\min_{1\leq k\leq m}\sigma_{k}\leq\max_{1\leq k\leq m}\upsilon_{k,4}^{1/4}\leq c_{u}$ for some constants $c_{l},c_{u}>0$ , where $\sigma_{k}^{2}=\mathbb{E}(\varepsilon_{k}^{2})$ and $\upsilon_{k,4}=\mathbb{E}(\varepsilon_{k}^{4})$ .

Theorem 4.1.

Assume Condition 3 holds and $m=m(n)$ satisfies $m\to\infty$ and $\log m=o(n^{1/3})$ . Moreover, as $(n,m)\to\infty$ ,

[TABLE]

for some $\lambda_{0}>2$ . Then, with

[TABLE]

in (4.1) and (4.2), it holds

[TABLE]

where $m_{0}={\rm card}(\mathcal{H}_{0})$ .

In practice, conditional quantiles of $\sqrt{n}\,(\widehat{\mu}_{k}^{\mathrm{\tiny{b}}}-\widehat{\mu}_{k})$ can be computed with arbitrary precision by using the Monte Carlo simulations: Independent of the observed data, generate IID random weights $\{W_{ik,b},1\leq i\leq n,1\leq k\leq m,1\leq b\leq B\}$ from $W$ , where $B$ is the number of bootstrap replications. For each $k$ , the bootstrap samples of $(\widehat{\mu}_{k}^{\mathrm{\tiny{b}}},\widehat{\bm{\beta}}^{\mathrm{\tiny{b}}}_{k})$ are given by

[TABLE]

For $k=1,\ldots,m$ , define empirical tail distributions

[TABLE]

The bootstrap $p$ -values are thus given by $\{p^{\mathrm{\tiny{b}}}_{k,B}=G^{\mathrm{\tiny{b}}}_{k,B}(\sqrt{n}\,|\widehat{\mu}_{k}|)\}_{k=1}^{m}$ , to which either the BH procedure or Storey’s procedure can be applied. For the former, we reject $H_{0k}$ if and only if $p^{\mathrm{\tiny{b}}}_{k,B}\leq p^{\mathrm{\tiny{b}}}_{(k_{B}^{\mathrm{\tiny{b}}}),B}$ , where $k_{B}^{\mathrm{\tiny{b}}}=\max\{k:1\leq k\leq m,p^{\mathrm{\tiny{b}}}_{(k),B}\leq k\alpha/m\}$ for a predetermined $0<\alpha<1$ and $p^{\mathrm{\tiny{b}}}_{(1),B}\leq\cdots\leq p^{\mathrm{\tiny{b}}}_{(m),B}$ are the ordered bootstrap $p$ -values. See Algorithm 2 for detailed implementations.

5 Numerical studies

5.1 Confidence sets

We first provide simulation studies to illustrate the performance of the robust bootstrap procedure for constructing confidence sets with various heavy-tailed errors. Recall the linear model $Y_{i}=\bm{X}_{i}^{\intercal}\bm{\theta}^{*}+\varepsilon_{i}$ in (1.1). We simulate $\{\bm{X}_{i}\}_{i=1}^{n}$ from $\mathcal{N}(0,\bm{I}_{d})$ . The true regression coefficient $\bm{\theta}^{*}$ is a vector equally spaced between $[0,1]$ . The errors $\varepsilon_{i}$ are IID from one of the following distributions, standardized to have mean 0 and variance 1.

Standard Gaussian distribution $\mathcal{N}(0,1)$ ; 2. 2.

$t_{\nu}$ -distribution with degrees of freedom $\nu=3.5$ ; 3. 3.

Gamma distribution with shape parameter $3$ and scale parameter $1$ ; 4. 4.

$t$ -Weibull mixture (Wbl mix) model: $\varepsilon=0.5u_{t}+0.5u_{\mathrm{W}}$ , where $u_{t}$ follows a standardized $t_{4}$ -distribution and $u_{\rm W}$ follows a standardized Weibull distribution with shape parameter $0.75$ and scale parameter $0.75$ ; 5. 5.

Pareto-Gaussian mixture (Par mix) model: $\varepsilon=0.5u_{\mathrm{P}}+0.5u_{\mathrm{G}}$ , where $u_{\mathrm{P}}$ follows a Pareto distribution with shape parameter $4$ and scale parameter $1$ and $u_{{\rm G}}\sim\mathcal{N}(0,1)$ ; 6. 6.

Lognormal-Gaussian mixtrue (Logn mix) model: $\varepsilon=0.5u_{\mathrm{LN}}+0.5u_{\mathrm{G}}$ , where $u_{\mathrm{LN}}=\exp(1.25Z)$ with $Z\sim\mathcal{N}(0,1)$ and $u_{{\rm G}}\sim\mathcal{N}(0,1)$ .

Moreover, we consider three types of random weights as follows:

•

Gaussian weights: $W_{i}\sim\mathcal{N}(0,1)+1$ ;

•

Bernoulli weights (with mean 0.5): $W_{i}\sim 2\mathrm{Ber}(0.5)$ ;

•

A mixture of Bernoulli and Gaussian weights considered by Zhilova (2016): $W_{i}=z_{i}+u_{i}+1$ , with $u_{i}\sim(\mathrm{Ber}(b)-b)\sigma_{u}$ , $b=0.276$ , $\sigma_{u}=0.235$ , and $z_{i}\sim\mathcal{N}(0,\sigma_{z}^{2})$ , $\sigma_{z}^{2}=0.038$ .

All three weights considered are such that $\mathbb{E}(W_{i})=\textnormal{var}(W_{i})=1$ . Using non-negative random weights has the advantage that the weighted objective function is guaranteed to be convex. Numerical results reveal that Gaussian and Bernoulli weights demonstrate almost the same coverage performance.

The number of bootstrap replications is set to be $B=2000$ . Nominal coverage probabilities $1-\alpha$ are given in the columns, where we consider $1-\alpha\in\{0.95,0.90,0.85,0.80,0.75\}$ . We report the empirical coverage probabilities from $1000$ simulations. We first consider a simple approach for choosing $\tau$ , which is set to be $1.2\{\widehat{\nu}_{4}n/(d+\log n)\}^{1/4}$ . Here, $\widehat{\nu}_{4}$ is the empirical fourth moment of the residuals from the OLS and the constant 1.2 (which is slightly larger than 1) is chosen in accordance with Theorem 2.5 which requires $v\geq\upsilon_{4}^{1/4}$ . This simple ad hoc approach leads to adequate results in most cases. In Section 5.2, we further investigate the empirical performance of the fully data-dependent procedure proposed in Section 3.

We compare our method with an OLS-based bootstrap procedure studied in Spokoiny and Zhilova (2015), namely, replacing the weighted Huber loss in (2.16) by the weighted quadratic loss $\mathcal{L}_{\mathrm{ols}}^{\mathrm{\tiny{b}}}(\bm{\theta})=\sum_{i=1}^{n}W_{i}(Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta})^{2}$ .

Consider the sample size $n=100$ and dimension $d=5$ . Table 1 and Table 2 display the coverage probabilities of the proposed bootstrap Huber method (boot-Huber) and the bootstrap OLS method (boot-OLS). At the normal model, our approach achieves a similar performance as the boot-OLS, which demonstrates the efficiency of adaptive Huber regression. For heavy-tailed errors, our method significantly outperforms the boot-OLS using all three types of random weights. Also, we observe that the Gaussian and Bernoulli weights demonstrate nearly the same desirable performance. For simplicity, we focus on the Gaussian weights throughout the remaining simulation studies.

In Table 3, we increase the sample size to $n=200$ and retain all the other settings. For most cases of heavy-tailed errors, the coverage probability of the boot-OLS method is lower than the nominal level, sometimes to a large extent. In Table 4, we generate errors from a $t$ -Weilbull mixture distribution and consider different combinations of $n$ ( $n\in\{50,100,200\}$ ) and $d$ ( $d\in\{2,5,10\}$ ). The robust procedure outperforms the least squares method across most of the settings. Similar phenomena are also observed in other cases of heavy-tailed errors.

We also report the standard deviations of the estimated quantiles of boot-Huber and boot-OLS; see Appendix F.1 in the supplement. The experimental results show that the boot-Huber leads to uniformly smaller standard deviations. Furthermore, we consider more challenging settings with correlated or non-Gaussian designs and non-equally spaced $\bm{\theta}^{*}$ . The average coverage probabilities of the boot-Huber method are in general close to nominal level, while the boot-OLS leads to severe under-coverage in many heavy-tailed noise settings. More details are presented in Appendix F.2 in the supplementary material.

5.2 Performance of the data-driven tuning approach

We further investigate the empirical performance of the data-driven procedure proposed in Section 3. We consider lognormal distributions $Logn(\mu,\sigma)$ with location parameter $\mu=0$ and varying shape parameters $\sigma$ . The larger the value of $\sigma$ is, the heavier the tail is. Moreover, we take $n=200$ , $d=5$ and $1-\alpha\in[0.85,0.99]$ and compare three methods: (1) Huber-based bootstrap procedure with $\tau$ calibrated by solving (3.20) (adaptive boot-Huber), (2) Huber-based bootstrap procedure with $\tau=1.2\{\widehat{\nu}_{4}n/(d+\log n)\}^{1/4}$ (boot-Huber), and (3) OLS-based bootstrap method (boot-OLS).

From Figure 1 and Table 5 we see that, under lognormal models, the coverage probabilities of the adaptive boot-Huber method are closest to nominal levels, while the boot-OLS suffers from distorted empirical coverage: it tends to overestimate the real quantiles at high levels and severely underestimate the real quantiles at relatively lower levels. In addition, Figure 1–(a) shows that the proposed Huber-based procedure almost loses no efficiency under a normal model.

6 Discussion

In this paper, we have proposed and analyzed robust inference methods for linear models with heavy-tailed errors. Specifically, we use a multiplier bootstrap procedure for constructing sharp confidence sets for adaptive Huber estimators and conducting large-scale simultaneous inference with heavy-tailed panel data. Our theoretical results provide explicit bounds for the bootstrap approximation errors and justify the bootstrap validity; the error of coverage probability is small as long as $d^{3}/n$ is small. For multiple testing, we show that when the error distributions have finite 4th moments and the dimension $m$ and sample size $n$ satisfy $\log m=o(n^{1/3})$ , the bootstrap Huber procedure asymptotically controls the overall false discovery proportion at the nominal level.

Furthermore, the proposed robust inference method can be potentially applied to a broad range of statistical problems, including high dimensional sparse regression, reduced rank regression, covariance matrix estimation and low-rank matrix recovery. We leave such an extension for further research.

Supplementary Material

This supplemental material contains (1) the proofs of Theorems 2.1–2.6 and Theorem 3.1 in the main text, (2) implementation of the proposed methods, and (3) additional simulation studies.

Appendix A Notations and Preliminaries

A.1 Notations

Recall that the error variable $\varepsilon$ has mean zero and variance $\sigma^{2}=\mathbb{E}(\varepsilon^{2})>0$ . For every $\tau>0$ , we define the truncated mean and second moment of $\varepsilon$ to be

[TABLE]

where $\psi_{\tau}(u):=\ell^{\prime}_{\tau}(u)=\mathop{\mathrm{sign}}(u)\min(|u|,\tau)$ , $u\in\mathbb{R}$ . For IID random variables $\varepsilon_{1},\ldots,\varepsilon_{n}$ from $\varepsilon$ , we define truncated variables

[TABLE]

The dependence of $\xi_{i}$ on $\tau$ will be assumed without displaying.

Moreover, define the $d\times d$ random matrix

[TABLE]

and the random variable

[TABLE]

Throughout, we use $\mathbb{P}^{\dagger}$ -probability to denote the probability measure over $\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ and use $\mathbb{P}^{*}$ -probability to denote the probability measure over $\{U_{i}\}_{i=1}^{n}$ conditioning on $\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ . In general, $\mathbb{P}$ denotes the probability measure over all the random variables involved.

A.2 Technical lemmas

In this section, we provide several technical lemmas that will be used repeatedly to prove the main theorems. Recall the isotropic random vectors $\bm{Z}_{i}$ given in (2.5). The first two lemmas provide concentration properties for $M_{n,4}$ and $\bm{S}_{n}$ , respectively.

Lemma A.1.

Assume Condition 1 holds. Then for any $x>0$ ,

[TABLE]

with probability at least $1-2e^{-x}$ , where $C>0$ depends only on $A_{0}$ .

Proof.

The proof is based on the covering argument. For any $\epsilon\in(0,1)$ , we can find an $\epsilon$ -net $\mathcal{N}_{\epsilon}$ of the unit sphere $\mathbb{S}^{d-1}$ satisfying ${\rm card}(\mathcal{N}_{\epsilon})\leq(1+2/\epsilon)^{d}$ . For every $\bm{u}\in\mathbb{S}^{d-1}$ , there exists some $\bm{v}\in\mathcal{N}_{\epsilon}$ such that $\|\bm{u}-\bm{v}\|_{2}\leq\epsilon$ . Define the map $f:\mathbb{R}^{d}\to\mathbb{R}^{n}$ as

[TABLE]

By the triangle inequality,

[TABLE]

Taking the maximum over $\bm{v}\in\mathcal{N}_{\epsilon}$ and then taking the supremum over $\bm{u}\in\mathbb{S}^{d-1}$ , we arrive at

[TABLE]

where $N_{n,\epsilon}=\max_{\bm{v}\in\mathcal{N}_{\epsilon}}(1/n)\sum_{i=1}^{n}(\bm{v}^{\intercal}\bm{Z}_{i})^{4}$ . Solving this inequality yields

[TABLE]

For every $\bm{v}\in\mathcal{N}_{\epsilon}$ , note that $\mathbb{P}\{(\bm{v}^{\intercal}\bm{Z}_{i})^{4}\geq y\}\leq 2e^{-\sqrt{y}/A_{0}^{2}}$ for any $y>0$ . Hence, by inequality (3.6) in Adamczak et al. (2011) with $s=1/2$ , we obtain that for any $z>0$ ,

[TABLE]

where $c>0$ is a universal constant and $C_{1}>0$ depends only on $A_{0}$ . Taking the union bound over all vectors $\bm{v}$ in $\mathcal{N}_{\epsilon}$ gives

[TABLE]

It follows that

[TABLE]

with probability at least $1-2e^{-x}$ , where $C_{2}>0$ depends only on $A_{0}$ .

Finally, taking $\epsilon=1/8$ in (A.6) and (A.7) implies (A.5). ∎

Lemma A.2.

Assume Condition 1 holds with $\delta=2$ . For any $x>0$ ,

[TABLE]

with probability at least $1-2e^{-x}$ .

Proof.

Define random variables $w_{i}=\xi_{i}/\sigma_{\tau}$ so that $\mathbb{E}(w_{i}^{2})=1$ . We will bound $\|\bm{\Delta}\|_{2}$ via a standard covering argument, where

[TABLE]

Proceed similarly to the proof of Lemma 4.4.1 in Vershynin (2018), it can be shown that there exists a $1/4$ -net $\mathcal{N}_{1/4}$ of the unit sphere $\mathbb{S}^{d-1}$ satisfying $|\mathcal{N}_{1/4}|\leq 9^{d}$ such that

[TABLE]

For any $\bm{u}\in\mathbb{S}^{d-1}$ , by (B.7) we have $\mathbb{E}(\bm{u}^{\intercal}\bm{Z})^{2k}\leq 2A_{0}^{2k}\,k!$ for all $k\geq 1$ . This implies

[TABLE]

It then follows from Bernstein’s inequality that for any $x\geq 0$ ,

[TABLE]

Taking the union bound over all $\bm{u}\in\mathcal{N}_{1/4}$ yields

[TABLE]

with probability at least $1-9^{d}\cdot 2e^{-x}$ . Reinterpret this we reach (A.8). ∎

The next lemma gives a deviation bound for the $\ell_{2}$ -norm of the $d$ -variate random vector $\bm{\xi}^{\mathrm{\tiny{b}}}=-\sum_{i=1}^{n}\xi_{i}U_{i}\bm{Z}_{i}$ , where $\xi_{i}$ are given in (A.2). Recall that $\mathbb{P}^{*}$ is the conditional probability measure over the random multipliers given $\mathcal{D}_{n}=\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ .

Lemma A.3.

Assume Condition 2 is fulfilled. For every $x>0$ , it holds with $\mathbb{P}^{*}$ -probability at least $1-2e^{-x}$ that

[TABLE]

where $B_{U}>0$ is a constant depending only on $A_{U}$ .

Proof.

The proof is based on an argument similar to that leads to (B.8). There exists a $1/2$ -net $\mathcal{N}_{1/2}$ of $\mathbb{S}^{d-1}$ with cardinality $|\mathcal{N}_{1/2}|\leq 5^{d}$ such that

[TABLE]

For each fixed $\bm{u}\in\mathbb{S}^{d-1}$ , applying Theorem 2.6.3 in Vershynin (2018) gives

[TABLE]

where $C=C(A_{U})>0$ and $\xi_{i}=\psi_{\tau}(\varepsilon_{i})$ are as in (A.2). Taking the union bound over all vectors $\bm{u}\in\mathcal{N}_{1/2}$ , we obtain that with $\mathbb{P}^{*}$ -probability greater than $1-5^{d}\cdot 2e^{-x}$ ,

[TABLE]

Reinterpret this inequality to obtain the stated result (A.9). ∎

Recall the random process $\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta})=-\sum_{i=1}^{n}\ell^{\prime}_{\tau}(Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta})U_{i}\bm{Z}_{i}$ , $\bm{\theta}\in\mathbb{R}^{d}$ defined in (2.20). The following lemma gives an upper bound on the local fluctuation $\sup_{\bm{\theta}\in\Theta_{0}(r)}\|\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta})-\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta}^{*})\|_{2}$ for $r>0$ .

Lemma A.4.

Assume Condition 2 holds. For any $x>0$ , it holds with $\mathbb{P}^{*}$ -probability at least $1-e^{-x}$ that

[TABLE]

where $C>0$ depends only on $A_{U}$ and $M_{n,4}$ is given in (A.4).

Proof.

To begin with, note that

[TABLE]

Define a new process $\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\omega})=\bm{\omega}^{\intercal}\{\bm{\xi}^{\mathrm{\tiny{b}}}({\bm{\theta}})-\bm{\xi}^{\mathrm{\tiny{b}}}({\bm{\theta}}^{*})\}/(2r\sqrt{n})$ for $\bm{\theta}\in\Theta_{0}(r)$ and $\bm{\omega}\in\mathbb{B}^{d}(r)$ , so that

[TABLE]

It is easy to see that $\mathbb{E}^{*}\{\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\omega})\}=0$ and

[TABLE]

For any $\bm{u},\bm{v}\in\mathbb{R}^{d}$ and $\lambda\in\mathbb{R}$ , by Hölder’s inequality,

[TABLE]

Write $\overline{\bm{u}}=\bm{\Sigma}^{1/2}\bm{u}/\|((\bm{\Sigma}^{1/2}\bm{u})^{\intercal},\bm{v})\|_{2}$ and $\overline{\bm{v}}=\bm{v}/\|((\bm{\Sigma}^{1/2}\bm{u})^{\intercal},\bm{v})\|_{2}$ . For the first term on the right-hand side of (A.12), it follows from (2.20) and Condition 2 that

[TABLE]

almost surely. For the second term, by the mean value theorem and taking $\bm{\delta}_{r}=\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{*})/r$ , we get

[TABLE]

almost surely, where $\widetilde{\bm{\theta}}$ is a convex combination of $\bm{\theta}$ and $\bm{\theta}^{*}$ and $M_{n,4}$ is given in (A.4). Putting (A.12), (A.13) and (A.14) together yields

[TABLE]

Applying a conditional version of Theorem A.1 in Spokoiny (2013) with $p=2d$ ,

[TABLE]

to the process $\{\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\omega}):\bm{\theta}\in\Theta_{0}(r),\bm{\omega}\in\mathbb{B}^{d}(r)\}$ in (A.11), we arrive at

[TABLE]

almost surely. This is the bound stated in (A.10). ∎

The following lemma provides moderate deviation results for the robust estimators $\widehat{\mu}_{k}$ given in (4.1).

Lemma A.5.

Assume Condition 3 holds. Let $\{a_{n}\}_{n\geq 1}$ be a sequence of positive numbers satisfying $a_{n}\to\infty$ and $a_{n}=o(n^{1/2})$ as $n\to\infty$ . For each $1\leq k\leq m$ , the robust estimator $\widehat{\mu}_{k}$ with $\tau_{k}=v_{k}\{n/(s+a_{n})\}^{1/3}$ for some $v_{k}\geq\upsilon_{k,4}^{1/4}$ satisfies

[TABLE]

uniformly for $0\leq z\leq o\{\sigma_{k}\min(n^{1/6},\sqrt{n}a_{n}^{-1})\}$ and $z\leq\sigma_{k}\sqrt{a_{n}}$ .

Proof.

Let $1\leq k\leq m$ be fixed and write $\tau=\tau_{k}$ for simplicity. Define truncated mean and variance of $\varepsilon_{k}$ by $m_{k,\tau}=\mathbb{E}\{\psi_{\tau}(\varepsilon_{k})\}$ and $\sigma_{k,\tau}^{2}=\mathbb{E}\{\psi^{2}_{\tau}(\varepsilon_{k})\}$ . Moreover, define

[TABLE]

Taking $\bm{X}=(1,\bm{x}^{\intercal})^{\intercal}$ , $\bm{\theta}^{*}=(\mu_{k},\bm{\beta}_{k}^{\intercal})^{\intercal}$ and $\varepsilon=\varepsilon_{k}$ in Theorem 2.1, we obtain that with probability at least $1-4e^{-a_{n}}$ ,

[TABLE]

as long as $n\geq C_{2}(s+a_{n})$ . We prove (A.15) by considering the following two cases.

Case 1: Assume $0\leq z/\sigma_{k}\leq 1$ . Applying Theorem 2.2 in Zhou et al. (2018) to $T_{k}$ gives

[TABLE]

where $C>0$ is an absolute constant. The stated result (A.15) then holds uniformly for $0\leq z\leq\sigma_{k}$ .

Case 2: Assume $1\leq z/\sigma_{k}\leq\sqrt{a_{n}}$ . It follows from Proposition A.2 with $\kappa=4$ in the supplement of Zhou et al. (2018) that $|m_{k,\tau}|\leq\upsilon_{k,4}\tau^{-3}$ . Together with (A.16), this implies that with probability at least $1-4e^{-a_{n}}$ ,

[TABLE]

It follows that

[TABLE]

Next we focus on $T_{0k}$ . Recall that $\tau=v_{k}\{n/(s+a_{n})\}^{1/3}$ with $v_{k}\geq\upsilon_{k,4}^{1/4}$ . To apply Lemma 3.1 in the supplement of Liu and Shao (2014), we take

[TABLE]

and note that

[TABLE]

where $C_{2}>0$ is an absolute constant. Consequently, taking $d_{n}=n^{-1/6}$ , $x=\sqrt{n}z$ and $t_{n}=(C_{3,1}^{-1/2}\vee 4)\{z/\sigma_{k}+(\log n)^{1/2}\}$ in Lemma 3.1 implies that for all sufficiently large $n$ ,

[TABLE]

uniformly over $0\leq z/\sigma_{k}\leq c_{2}\min(c_{n}^{-1},\beta_{n}^{-1/3},d_{n}^{-1})$ , where $Z\sim\mathcal{N}(0,1)$ , $c_{1}>0$ depends only on $(\sigma_{k},\upsilon_{k,3})$ and $C_{3},c_{2}>0$ are absolute constants. For normal distribution, it is known that for any $w>0$ ,

[TABLE]

Combining this with (A.19), we obtain that for $z>\sigma_{k}$ and $\delta_{1}$ in (A.17),

[TABLE]

and

[TABLE]

Finally, observe that $e^{-a_{n}}\leq e^{-a_{n}/2-(z/\sigma_{k})^{2}/2}$ for $z/\sigma_{k}\leq\sqrt{a_{n}}$ . Then it follows from (A.17)–(A.21) that (A.15) holds uniformly for $1\leq z/\sigma_{k}\leq o\{\min(n^{1/6},\sqrt{n}a_{n}^{-1})\}$ , which completes the proof. ∎

Appendix B Proofs for Section 2

Without loss of generality, we assume $t\geq\log 2$ , or equivalently $2e^{-t}\leq 1$ throughout the proof; otherwise if $2e^{-t}>1$ , the conclusion is trivial. Let $\|\cdot\|_{\bm{\Sigma},2}$ denote the rescaled $\ell_{2}$ -norm on $\mathbb{R}^{d}$ , i.e. $\|\bm{u}\|_{\bm{\Sigma},2}=\|\bm{\Sigma}^{1/2}\bm{u}\|_{2}$ for $\bm{u}\in\mathbb{R}^{d}$ .

B.1 Proof of Theorem 2.1

Proof of (2.6). To begin with, define the parameter set

[TABLE]

For any prespecified $r>0$ , we can find an intermediate estimator $\widehat{\bm{\theta}}_{\tau,\eta}=\bm{\theta}^{*}+\eta(\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*})$ for some $\eta\in[0,1]$ , satisfying $w(\eta):=\|\widehat{\bm{\theta}}_{\tau,\eta}-\bm{\theta}^{*}\|_{\bm{\Sigma},2}\leq r$ . In fact, if $\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(r)$ , we can simply take $\eta=1$ ; otherwise, since the function $w(\cdot):[0,1]\mapsto(0,\infty)$ is continuous with $w(0)=0$ and $w(1)>r$ , there always exists some $\eta\in(0,1)$ such that $w(\eta)=r$ . Applying Lemma F.2 in Fan et al. (2018) to the loss function $\overline{\mathcal{L}}_{\tau}:=(1/n)\mathcal{L}_{\tau}$ , we obtain

[TABLE]

where the last step uses the first order condition $\nabla\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})=\textbf{0}$ .

In what follows, we bound the two sides of (B.2) separately. Proposition B.1 below shows that $\overline{\mathcal{L}}_{\tau}$ is strongly convex on $\Theta_{0}(r)$ with high probability.

Proposition B.1.

Assume that kurtosises of the linear forms $\langle\bm{u},\bm{Z}\rangle$ are uniformly bounded by $\kappa^{4}$ for some $\kappa>0$ , i.e. $\mathbb{E}\langle\bm{u},\bm{Z}\rangle^{4}\leq\kappa^{4}\|\bm{u}\|_{2}^{4}$ for all $\bm{u}\in\mathbb{R}^{d}$ . Let $(\tau,r)$ and $(n,d,t)$ satisfy

[TABLE]

where $C>0$ is an absolute constant. Then with probability at least $1-e^{-t}$ ,

[TABLE]

By construction, $\widehat{\bm{\theta}}_{\tau,\eta}\in\Theta_{0}(r)$ and therefore under the scaling (B.3),

[TABLE]

with probability at least $1-e^{-t}$ .

Next we bound the quadratic form $\|\bm{\Sigma}^{-1/2}\nabla\overline{\mathcal{L}}_{\tau}(\bm{\theta}^{*})\|_{2}$ . Define the centered random vector $\bm{\gamma}=\bm{\Sigma}^{-1/2}\{\nabla\overline{\mathcal{L}}_{\tau}(\bm{\theta}^{*})-\mathbb{E}\nabla\overline{\mathcal{L}}_{\tau}(\bm{\theta}^{*})\}$ so that

[TABLE]

To bound $\|\bm{\gamma}\|_{2}$ , by a standard covering argument, there exits a $1/2$ -net $\mathcal{N}_{1/2}$ of $\mathbb{S}^{d-1}$ with $|\mathcal{N}_{1/2}|\leq 5^{d}$ such that $\|\bm{\gamma}\|_{2}\leq 2\max_{\bm{u}\in\mathcal{N}_{1/2}}|\bm{u}^{\intercal}\bm{\gamma}|$ . For every $\bm{u}\in\mathbb{S}^{d-1}$ , note that $|\bm{u}^{\intercal}\bm{\gamma}|=|(1/n)\sum_{i=1}^{n}\{\xi_{i}\bm{u}^{\intercal}\bm{Z}_{i}-\mathbb{E}\xi_{i}\bm{u}^{\intercal}\bm{Z}_{i}\}|$ , where $\xi_{i}=\ell^{\prime}_{\tau}(\varepsilon_{i})$ and $\bm{Z}_{i}$ are IID from $\bm{Z}$ given in (2.5). Since $\bm{u}^{\intercal}\bm{Z}$ is sub-Gaussian, it follows from the proof of Proposition 2.5.2 in Vershynin (2018) that

[TABLE]

If $k=2\ell$ for some $\ell\geq 1$ , $\mathbb{E}|\bm{u}^{\intercal}\bm{Z}|^{k}\leq 2A_{0}^{k}(k/2)!$ ; otherwise if $k=2\ell+1$ for some $\ell\geq 1$ ,

[TABLE]

By the above calculations, we obtain

[TABLE]

Applying Bernstein’s inequality, we see that

[TABLE]

Taking the union bound over all vectors $\bm{u}\in\mathcal{N}_{1/2}$ , we obtain that with probability at least $5^{d}\cdot 2e^{-x}$ , $\|\bm{\gamma}\|_{2}\leq 2\max_{\bm{u}\in\mathcal{N}_{1/2}}|\bm{u}^{\intercal}\bm{\gamma}|\leq 4\sigma A_{0}\sqrt{x/n}+A_{0}\tau x/n$ . Taking $x=2(d+t)\geq\log(5^{d})+2t$ , we reach

[TABLE]

For the second term $\|\bm{\Sigma}^{-1/2}\mathbb{E}\nabla\overline{\mathcal{L}}_{\tau}(\bm{\theta}^{*})\|_{2}$ in (B.6), it holds

[TABLE]

Together, the last two displays imply with probability at least $1-e^{-t}$ ,

[TABLE]

Finally, in view of (B.3), (B.5) and (B.9), we choose $r=\tau/(4\kappa^{2})$ . Under Condition 1, $\kappa$ scales as $A_{0}$ . Then with probability at least $1-2e^{-t}$ , $\widehat{\bm{\theta}}_{\tau,\eta}\in\Theta_{0}(4r_{0})$ under the scaling (B.3). Provided $n\gtrsim A_{0}^{4}(d+t)$ , we have $r>4r_{0}$ so that $\widehat{\bm{\theta}}_{\tau,\eta}$ lies in the interior of $\Theta_{0}(r)$ , which enforces $\eta=1$ and $\widehat{\bm{\theta}}_{\tau,\eta}=\widehat{\bm{\theta}}_{\tau}$ (otherwise $\widehat{\bm{\theta}}_{\tau,\eta}$ will lie on the boundary). Putting together the pieces, we arrive at (2.6).

Proof of (2.7). Next we prove (2.7). Define random processes

[TABLE]

and $\bm{\zeta}(\bm{\theta})=\overline{\mathcal{L}}_{\tau}(\bm{\theta})-\mathbb{E}\overline{\mathcal{L}}_{\tau}(\bm{\theta})$ . In this notation, we have

[TABLE]

In the following, we will deal with $\bm{B}(\bm{\theta})-\mathbb{E}\{\bm{B}(\bm{\theta})\}$ and $\mathbb{E}\{\bm{B}(\bm{\theta})\}$ separately, starting with the latter. By the mean value theorem for vector-valued functions (see, e.g. Theorem 12 in Section 2 of Pugh (2015)),

[TABLE]

where $\bm{\theta}^{*}_{t}=(1-t)\bm{\theta}^{*}+t\bm{\theta}$ . Note that

[TABLE]

For every $t\in[0,1]$ , since $\bm{\theta}\in\Theta_{0}(r)$ and $\bm{u}\in\mathbb{S}^{d-1}$ , we have $\bm{\theta}^{*}_{t}\in\Theta_{0}(r)$ so that $\bm{\delta}_{t}:=\bm{\Sigma}^{1/2}(\bm{\theta}_{t}^{*}-\bm{\theta}^{*})$ satisfies $\|\bm{\delta}_{t}\|_{2}\leq r$ . Consequently, by Markov’s inequality and (B.7),

[TABLE]

where $\kappa>0$ is as in Proposition B.1. Putting together the pieces implies

[TABLE]

Turing to $\bm{B}(\bm{\theta})-\mathbb{E}\{\bm{B}(\bm{\theta})\}=\bm{\Sigma}^{-1/2}\{\nabla\bm{\zeta}(\bm{\theta})-\nabla\bm{\zeta}(\bm{\theta}^{*})\}$ , we set

[TABLE]

It is easy to see that $\overline{\bm{B}}(\textbf{0})=\textbf{0}$ , $\mathbb{E}\{\overline{\bm{B}}(\bm{\delta})\}=\textbf{0}$ and

[TABLE]

In addition, for any $\bm{u},\bm{v}\in\mathbb{S}^{d-1}$ and $\lambda\in\mathbb{R}$ , using the inequality $|e^{z}-1-z|\leq z^{2}e^{|z|}/2$ for all $z\in\mathbb{R}$ gives

[TABLE]

By the Cauchy-Schwarz inequality,

[TABLE]

and

[TABLE]

Combining the last three displays, we arrive at

[TABLE]

Under Condition 1, there exists a constant $A_{1}=A_{1}(A_{0})>0$ such that, for all $|\lambda|\leq\sqrt{n}/A_{1}$ and $\bm{\theta}\in\mathbb{R}^{d}$ ,

[TABLE]

where $C>0$ is an absolute constant. With the above preparations, applying Theorem A.3 in Spokoiny (2013) which is a direct consequence of Corollary 2.2 in the supplement to Spokoiny (2012), yields

[TABLE]

as long as $n\geq A_{1}^{2}(8d+2t)$ . Combining this and (B.11), we reach

[TABLE]

with probability at least $1-e^{-t}$ , where $\bm{B}(\bm{\theta})$ is given in (B.10). Recalling the paragraph below (B.9), we have $\mathbb{P}\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(4r_{0})\}\geq 1-2e^{-t}$ and $\nabla\overline{\mathcal{L}}_{\tau}(\widehat{\bm{\theta}}_{\tau})=\textbf{0}$ . On the event $\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(4r_{0})\}$ , it holds $\|\bm{B}(\widehat{\bm{\theta}}_{\tau})\|_{2}\leq\Delta(4r_{0})$ . Consequently, taking $r=4r_{0}$ in (B.13) proves (2.7). ∎

B.2 Proof of Theorem 2.2

Keeping the notation appeared in the proof of Theorem 2.1, we consider the following local quadratic approximation of the Huber loss. For any $r>0$ and $\bm{\theta},\bm{\theta}^{\prime}\in\Theta_{0}(r)$ , define

[TABLE]

Taking the gradient with respect to $\bm{\theta}$ , we get $\nabla_{\bm{\theta}}R(\bm{\theta},\bm{\theta}^{\prime})=\nabla\mathcal{L}_{\tau}(\bm{\theta})-\nabla\mathcal{L}_{\tau}(\bm{\theta}^{\prime})-n\bm{\Sigma}(\bm{\theta}-\bm{\theta}^{\prime})$ . Then, by the mean value theorem, $R(\bm{\theta},\bm{\theta}^{\prime})=(\bm{\theta}-\bm{\theta}^{\prime})^{\intercal}\{\nabla\mathcal{L}_{\tau}(\widetilde{\bm{\theta}})-\nabla\mathcal{L}_{\tau}(\bm{\theta}^{\prime})-n\bm{\Sigma}(\widetilde{\bm{\theta}}-\bm{\theta}^{\prime})\}$ , where $\widetilde{\bm{\theta}}$ is a convex combination of $\bm{\theta}$ and $\bm{\theta}^{\prime}$ and hence $\widetilde{\bm{\theta}}\in\Theta_{0}(r)$ . It follows that

[TABLE]

where $\Delta(r)$ is as in (B.13). Recall from the proof of Theorem 2.1 that $\mathbb{P}\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(4r_{0})\}\geq 1-2e^{-t}$ for $r_{0}$ given in (B.9). Taking $r=4r_{0}$ in (B.13), $(\bm{\theta},\bm{\theta}^{\prime})=(\bm{\theta}^{*},\widehat{\bm{\theta}}_{\tau})$ in (B.14) and using the fact $\nabla\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})=\textbf{0}$ , we obtain that with probability greater than $1-3e^{-t}$ ,

[TABLE]

Write $\widehat{\bm{\delta}}=\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}-\bm{\theta}^{*})$ and $\bm{\Gamma}^{*}=\bm{\Sigma}^{-1/2}\nabla\overline{\mathcal{L}}_{\tau}(\bm{\theta}^{*})$ . By (B.9) and (B.13), we have $\|\bm{\Gamma}^{*}\|_{2}\leq r_{0}$ , $\|\widehat{\bm{\delta}}+\bm{\Gamma}^{*}\|_{2}\leq\Delta(4r_{0})$ and

[TABLE]

Together, the last two displays imply that with probability at least $1-3e^{-t}$ ,

[TABLE]

which, together with (B.13), proves (2.8)

For the square-root Wilks’ expansion, on $\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(4r_{0})\}$ it holds

[TABLE]

where the last step follows from (B.14). Moreover, note that

[TABLE]

Combining the last two displays with (B.13) proves (2.9). ∎

B.3 Proof of Proposition B.1

Since the Huber loss is convex and differentiable, we have

[TABLE]

where $I_{\mathcal{E}_{i}}$ is the indicator function of the event

[TABLE]

on which $|Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta}|\leq\tau$ for all $\bm{\theta}\in\Theta_{0}(r)$ . Also, recall that $\ell_{\tau}^{\prime\prime}(u)=1$ for $|u|\leq\tau$ . For any $R>0$ , define functions

[TABLE]

In particular, $\varphi_{R}$ is $R$ -Lipschitz and satisfies

[TABLE]

It then follows that

[TABLE]

To bound the right-hand side of (B.17), consider the supremum of a random process indexed by $\Theta_{0}(r)$ :

[TABLE]

For any $\bm{\theta}$ fixed, write $\bm{\delta}=\bm{\theta}-\bm{\theta}^{*}$ . By (B.16),

[TABLE]

Provided $\tau\geq 2\max\{(4\upsilon_{2+\delta})^{1/(2+\delta)},4\kappa^{2}r\}$ , it follows that

[TABLE]

for all $\bm{\theta}\in\Theta_{0}(r)$ . From (B.17)–(B.19), we conclude that

[TABLE]

Next we deal with the stochastic term $\Delta_{r}$ defined in (B.18). For $g(\bm{\theta})$ given in (B.17), we write $g(\bm{\theta})=(1/n)\sum_{i=1}^{n}g_{i}(\bm{\theta})$ . Recalling that $0\leq\varphi_{R}(u)\leq R^{2}/4$ and $0\leq\psi_{R}(u)\leq 1$ for all $u\in\mathbb{R}$ , it is easy to see that $0\leq g_{i}(\bm{\theta})\leq(\tau/4r)^{2}\|\bm{\theta}-\bm{\theta}^{*}\|_{\bm{\Sigma},2}^{2}$ . By Talagrand’s inequality (see, e.g. Theorem 7.3 in Bousquet (2003)), we have for any $x>0$ ,

[TABLE]

where $\sigma_{n}^{2}=\sup_{\bm{\theta}\in\Theta_{0}(r)}\mathbb{E}g_{i}^{2}(\bm{\theta})/\|\bm{\theta}-\bm{\theta}^{*}\|_{\bm{\Sigma},2}^{4}$ . By (B.16), $\mathbb{E}g_{i}^{2}(\bm{\theta})\leq\mathbb{E}\langle\bm{X}_{i},\bm{\theta}-\bm{\theta}^{*}\rangle^{4}\leq\kappa^{4}\|\bm{\theta}-\bm{\theta}^{*}\|_{\bm{\Sigma},2}^{4}$ , implying $\sigma_{n}\leq\kappa^{2}$ .

To bound the expectation $\mathbb{E}\Delta_{r}$ , applying the symmetrization inequality for empirical processes, and by the connection between Gaussian and Rademacher complexities, we have $\mathbb{E}\Delta_{r}\leq 2\sqrt{\pi/2}\,\mathbb{E}\{\sup_{\bm{\theta}\in\Theta_{0}(r)}|\mathbb{G}_{\bm{\theta}}|\}$ , where

[TABLE]

and $g_{i}$ are IID standard normal random variables that are independent of the observed data. For any $\bm{\theta}_{0}\in\Theta_{0}(r)$ , it holds

[TABLE]

where $\mathbb{E}^{*}$ denotes the conditional expectation given $\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ . Taking the expectation with respect to $\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ on both sides, we see that (B.22) remains valid with $\mathbb{E}^{*}$ replaced by $\mathbb{E}$ . To select a proper $\bm{\theta}_{0}$ , first decompose $\bm{\theta}^{*}$ as $(\theta_{0},\widetilde{\bm{\theta}}^{*\intercal})^{\intercal}$ , where $\theta_{0}$ denotes the first coordinate of $\bm{\theta}^{*}$ and $\widetilde{\bm{\theta}}^{*}\in\mathbb{R}^{d-1}$ consists of the remaining. Taking $\bm{\theta}_{0}=(\theta_{0}+\sigma_{11}^{-1/2}r,\widetilde{\bm{\theta}}^{*\intercal})^{\intercal}$ , we observe that $\|\bm{\theta}_{0}-\bm{\theta}^{*}\|_{\bm{\Sigma},2}=r$ . Since $\varphi_{R}(u)\leq\min(u^{2},R^{2}/4)$ , it holds

[TABLE]

As in the proof of Lemma 11 in Loh and Wainwright (2015), we next use the Gaussian comparison theorem to bound the expectation of the (conditional) Gaussian supremum $\mathbb{E}^{*}\{\sup_{\Theta_{0}(r)}\mathbb{G}_{\bm{\theta}}\}$ .

Let $\textnormal{var}^{*}$ be the conditional variance given $\{(Y_{i},\bm{X}_{i})\}_{i=1}^{n}$ . For $\bm{\theta},\bm{\theta}^{\prime}\in\Theta_{0}(r)$ , write $\bm{\delta}=\bm{\theta}-\bm{\theta}^{*}$ and $\bm{\delta}^{\prime}=\bm{\theta}^{\prime}-\bm{\theta}^{*}$ . Then

[TABLE]

Note that $\varphi_{cR}(cu)=c^{2}\varphi_{R}(u)$ for any $c>0$ . In particular, taking $R=\tau\|\bm{\delta}^{\prime}\|_{\bm{\Sigma},2}/(2r)$ and $c=\|\bm{\delta}\|_{\bm{\Sigma},2}/\|\bm{\delta}^{\prime}\|_{\bm{\Sigma},2}$ delivers

[TABLE]

Putting the above calculations together, we obtain

[TABLE]

Next, define another (conditional) Gaussian process indexed by $\bm{\theta}$ :

[TABLE]

where $g^{\prime}_{i}$ are IID standard normal random variables that are independent of all other random variables. By (B.23), $\textnormal{var}^{*}(\mathbb{G}_{\bm{\theta}}-\mathbb{G}_{\bm{\theta}^{\prime}})\leq\textnormal{var}^{*}(\mathbb{Z}_{\bm{\theta}}-\mathbb{Z}_{\bm{\theta}^{\prime}})$ . By the Gaussian comparison inequality (Ledoux and Talagrand, 1991),

[TABLE]

Together with the unconditional version of (B.22), this implies

[TABLE]

Combining this with (B.21) and (B.20), we obtain that with probability at least $1-e^{-t}$ ,

[TABLE]

for all sufficiently large $n$ that scales as $(\tau/r)^{2}(d+t)$ up to an absolute constant. This proves (B.4). ∎

B.4 Proof of Theorem 2.3

Throughout we assume $t\geq 1$ and keep the notations used in the proof of Theorem 2.1.

Proof of (2.18). Recall the weighted loss function $\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\bm{\theta})=\sum_{i=1}^{n}W_{i}\ell_{\tau}(Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta})$ , $\bm{\theta}\in\mathbb{R}^{d}$ and the parameter set $\Theta_{0}(r)$ defined in (B.1). Write $r_{1}=4r_{0}$ for $r_{0}$ as in (B.9), so that

[TABLE]

for some $C_{1}=C_{1}(A_{0})>0$ . By the definition of $\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}$ , $\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau})-\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})\leq 0$ and

[TABLE]

where $R_{1}:=\overline{\lambda}_{\bm{\Sigma}}^{1/2}R$ . If we can show that, for some $r_{2}\geq r_{1}$ to be specified,

[TABLE]

with high probability, then we must have $\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}\in\Theta_{0}(r_{2})$ with high probability. Here and below, we set $\partial\Theta_{0}(r):=\{\bm{\theta}\in\mathbb{R}^{d}:\|\bm{\theta}-\bm{\theta}^{*}\|_{\bm{\Sigma},2}=r\}$ .

Centering the weighted Huber loss function, we define

[TABLE]

Note that

[TABLE]

In the following, we bound $\Pi_{1}(\bm{\theta})$ and $\Pi_{2}(\bm{\theta})$ separately, starting with the latter which only depends on the observed data. As before, define $\bm{\zeta}(\bm{\theta})=\mathcal{L}_{\tau}(\bm{\theta})-\mathbb{E}\{\mathcal{L}_{\tau}(\bm{\theta})\}$ and consider the decomposition

[TABLE]

First we deal with $\Pi_{21}(\bm{\theta})$ . For every $r>0$ , define the random process

[TABLE]

We will use Theorem A.1 in Spokoiny (2013) to bound the local fluctuation $|U_{r}(\bm{\theta})-U_{r}(\bm{\theta}^{*})|$ over $\bm{\theta}\in\Theta_{0}(r)$ . For any random variable $X$ , we write $(\mathbb{I}-\mathbb{E})X=X-\mathbb{E}(X)$ . For every $\bm{\theta}\in\Theta_{0}(r)$ , $\bm{v}\in\mathbb{R}^{d}$ and $\lambda\in\mathbb{R}$ , putting $\overline{\bm{v}}=\bm{\Sigma}^{1/2}\bm{v}/\|\bm{v}\|_{\bm{\Sigma},2}$ and $\bm{\delta}_{r}=\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{*})/r$ , and by the mean value theorem, we have

[TABLE]

Similarly to (B.12), it can be shown that for all $|\lambda|\leq c_{1}\sqrt{n}$ and $\bm{\theta}\in\Theta_{0}(r)$ ,

[TABLE]

Using Theorem A.1 in Spokoiny (2013), we deduce that with $\mathbb{P}^{\dagger}$ -probability at least $1-e^{-t}$ ,

[TABLE]

as long as $n\geq c_{1}^{-2}(4d+2t)$ . In view of (B.26) and (B.27), it holds for every $r>0$ that

[TABLE]

with $\mathbb{P}^{\dagger}$ -probability at least $1-e^{-t}$ . The bound in (B.28) holds for any given $r>0$ . Following the slicing argument similar to that used in the proof of Theorem A.2 in Spokoiny (2013), it can be shown that with $\mathbb{P}^{\dagger}$ -probability at least $1-e^{-t}$ ,

[TABLE]

for all $r_{2}\leq r\leq r_{2}+R_{1}$ as long as $n\geq c_{1}^{-2}\{4d+2t+2\log(2+2R_{1}/r_{2})\}$ .

For $\Pi_{22}(\bm{\theta})$ , note that

[TABLE]

for $\bm{\gamma}$ as given in (B.6). This, together with (B.8), implies that with $\mathbb{P}^{\dagger}$ -probability at least $1-e^{-t}$ ,

[TABLE]

where $C_{3}=C_{3}(A_{0})>0$ .

Turning to $\Pi_{23}(\bm{\theta})$ , we define the function $h(\bm{\theta})=(1/n)\mathbb{E}\{\mathcal{L}_{\tau}(\bm{\theta})\}$ , $\bm{\theta}\in\mathbb{R}^{d}$ so that $\Pi_{23}(\bm{\theta})=n\{h(\bm{\theta})-h(\bm{\theta}^{*})\}$ . Put $\bm{\delta}=\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{*})$ . By (2.5) and the mean value theorem, it follows that

[TABLE]

where $\widetilde{\bm{\theta}}$ is a point lying between $\bm{\theta}$ and $\bm{\theta}^{*}$ . Since $\mathbb{E}(\varepsilon|\bm{Z})=0$ , we have $-\mathbb{E}\{\psi_{\tau}(\varepsilon)|\bm{Z}\}=\mathbb{E}[\{\varepsilon-\tau\mathop{\mathrm{sign}}(\varepsilon)\}I(|\varepsilon|>\tau)|\bm{Z}]$ , which further implies

[TABLE]

Moreover,

[TABLE]

where $\kappa>0$ is as in Proposition B.1. Putting the above calculations together yields that for any $\bm{\theta}\in\partial\Theta_{0}(r)$ ,

[TABLE]

where $b(r):=1-\sigma^{2}\tau^{-2}-\kappa^{4}\tau^{-2}r^{2}-2\upsilon_{4}\tau^{-3}r^{-1}$ , $r>0$ .

Combining (B.26), (B.29), (B.30) and (B.31), it follows that with $\mathbb{P}^{\dagger}$ -probability at least $1-2e^{-t}$ ,

[TABLE]

for all $\bm{\theta}\in\partial\Theta_{0}(r)$ with $r\in[r_{2},r_{2}+R_{1}]$ provided $n\geq c_{1}^{-2}\{4d+2t+2\log(2+2R_{1}/r_{2})\}$ .

Next we deal with the process $\Pi_{1}(\bm{\theta})=\bm{\zeta}^{\mathrm{\tiny{b}}}(\bm{\theta})-\bm{\zeta}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})$ in (B.25), where

[TABLE]

Decompose $\Pi_{1}(\bm{\theta})$ as

[TABLE]

We will use a conditional version of Theorem A.1 in Spokoiny (2013) to bound $\Pi_{11}(\bm{\theta})$ . Similarly to (B.27), define, for each $r>0$ ,

[TABLE]

Similarly to (B.1), define

[TABLE]

For every $\bm{\theta}\in\widehat{\Theta}_{0}(r)$ and $\bm{v}\in\mathbb{R}^{d}$ , by the mean value theorem we have

[TABLE]

where $\widetilde{\bm{\theta}}$ is a convex combination of $\bm{\theta}$ and $\widehat{\bm{\theta}}_{\tau}$ . Putting $\overline{}\bm{\delta}_{r}=\bm{\Sigma}^{1/2}(\bm{\theta}-\widehat{\bm{\theta}}_{\tau})/r$ and $\overline{\bm{v}}=\bm{\Sigma}^{1/2}\bm{v}/\|\bm{\Sigma}^{1/2}\bm{v}\|_{2}$ , we deduce that

[TABLE]

where

[TABLE]

Under Condition 2, it holds

[TABLE]

where $B_{U}=B_{U}(A_{U})>0$ . Plugging this into (B.36) shows

[TABLE]

With the above preparations in place, it follows from (B.34) and Theorem A.1 in Spokoiny (2013) that for any $r>0$ ,

[TABLE]

almost surely, where $M_{n,4}$ is given in (A.4). Again, using the slicing technique and applying the preceding bound to each slice separately, we obtain that with $\mathbb{P}^{*}$ -probability at least $1-e^{-t}$ ,

[TABLE]

for all $r_{1}\leq r\leq 2r_{2}+R_{1}$ . Note that, on the event $\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(r_{1})\}$ that occurs with $\mathbb{P}^{\dagger}$ -probability at least $1-2e^{-t}$ ,

[TABLE]

Combining the last two displays and taking $x=2t$ in Lemma A.1, we obtain that with $\mathbb{P}^{\dagger}$ -probability at least $1-3e^{-t}$ ,

[TABLE]

as long as $n\geq C_{0}(d+t)^{2}$ , where $C_{0}=C_{0}(A_{0})$ and $C_{4}=C_{4}(A_{0},A_{U})$ .

For $\Pi_{12}(\bm{\theta})$ in (B.33), it holds on the event $\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(r_{1})\}$ that, for every $\bm{\theta}\in\Theta_{0}(r)$ ,

[TABLE]

where $\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta})=\bm{\Sigma}^{-1/2}\nabla\bm{\zeta}^{\mathrm{\tiny{b}}}({\bm{\theta}})=-\sum_{i=1}^{n}\psi_{\tau}(Y_{i}-\bm{X}_{i}^{\intercal}\bm{\theta})U_{i}\bm{Z}_{i}$ is as in (2.20).

By Lemma A.4, it holds for each $r>0$ that

[TABLE]

almost surely. Combining this and Lemma A.1, we see that, conditioning on the same event where (B.37) holds,

[TABLE]

as long as $n\geq C_{0}(d+t)^{2}$ , where $C_{5}=C_{5}(A_{0},A_{U})>0$ .

For $\|\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta}^{*})\|_{2}$ , taking $x=2t$ in Lemmas A.2 and A.3, we see that with $\mathbb{P}^{\dagger}$ -probability at least $1-e^{-t}$ ,

[TABLE]

where $C_{6}=C_{6}(A_{U})>0$ . Combining (B.33), (B.37), (B.38), (B.39) and (B.40), we conclude that conditioning on some event that occurs with $\mathbb{P}^{\dagger}$ -probability at least $1-4e^{-t}$ , it holds

[TABLE]

with $\mathbb{P}^{*}$ -probability at least $1-3e^{-t}$ provided $n\geq C_{0}(d+t)^{2}$ .

Finally, combining (B.25), (B.32) and (B.41), and taking

[TABLE]

for some sufficiently large constant $C_{7}=C_{7}(A_{0},A_{U})>0$ , we conclude that, conditioning on some event that occurs with $\mathbb{P}^{\dagger}$ -probability at least $1-5e^{-t}$ ,

[TABLE]

provided $n\geq C_{0}(d+t)^{2}$ and $n\geq C_{8}\overline{\lambda}_{\bm{\Sigma}}$ , where $C_{8}=C_{8}(A_{0})>0$ . Reinterpret this we obtain (2.18).

Proof of (2.19). An argument similar to that given in the proof of Theorem 2.1 can be used to prove (2.19). Define the random process

[TABLE]

where $\bm{\zeta}^{\mathrm{\tiny{b}}}(\cdot)$ is given in (B.24). The stated result follows from a bound on

[TABLE]

and the facts that $\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(r_{1})$ and $\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}\in\Theta_{0}(r_{2})$ with high probability.

Note that $\mathbb{E}^{*}\{\bm{B}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})\}=\bm{B}(\bm{\theta})-\bm{B}(\bm{\theta}^{\prime})$ for $\bm{B}(\cdot)$ as in (B.10). It then follows from (B.13) that with $\mathbb{P}^{\dagger}$ -probability greater than $1-e^{-t}$ ,

[TABLE]

where $\delta(\cdot)$ is defined above (B.11) and $C_{9}=C_{9}(A_{0})>0$ .

For $\bm{B}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})-\mathbb{E}^{*}\{\bm{B}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})\}$ , note that

[TABLE]

Since we are interested in the case where both $\bm{\theta}$ and $\bm{\theta}^{\prime}$ are in a neighborhood of $\bm{\theta}^{*}$ , it suffices to focus on $\bm{\theta}$ . To proceed, we change the variable by $\bm{\delta}=\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{*})$ and define

[TABLE]

It is easy to see that $\overline{\bm{B}}^{\mathrm{\tiny{b}}}(\textbf{0})=\textbf{0}$ , $\mathbb{E}^{*}\{\overline{\bm{B}}^{\mathrm{\tiny{b}}}(\bm{\delta})\}=\textbf{0}$ and

[TABLE]

For any $\bm{u},\bm{v}\in\mathbb{S}^{d-1}$ and $\lambda\in\mathbb{R}$ , by Condition 2 we have

[TABLE]

where $B_{U}=B_{U}(A_{U})>0$ . Applying a conditional version of Theorem A.1 in Spokoiny (2013) delivers

[TABLE]

almost surely, where $M_{n,4}$ is given in (A.4).

Finally, we take $(\bm{\theta},\bm{\theta}^{\prime})=(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}},\widehat{\bm{\theta}}_{\tau})$ . By (2.6) and (2.18), $\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(r_{1})$ with probability at least $1-3e^{-t}$ , and with $\mathbb{P}^{\dagger}$ -probability at least $1-5e^{-t}$ ,

[TABLE]

Since $\nabla\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})=\textbf{0}$ , it holds $\bm{\Sigma}^{-1/2}\nabla\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})=\bm{\xi}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})$ , where $\bm{\xi}^{\mathrm{\tiny{b}}}(\cdot)$ is given in (2.20). Then, on the event $\{\widehat{\bm{\theta}}_{\tau}\in\Theta_{0}(r_{1})\}$ , it holds

[TABLE]

so that the bound in (B.39) can be applied. Moreover, by the triangle inequality, $\|\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}-\widehat{\bm{\theta}}_{\tau})\|_{2}\leq r_{1}+r_{2}$ with high probability, which in turn implies that $\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau}$ falls in the interior of $\{\bm{\theta}:\|\bm{\theta}-\widehat{\bm{\theta}}_{\tau}\|_{2}\leq R\}$ for all sufficiently large $n$ and hence $\nabla\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}^{\mathrm{\tiny{b}}}_{\tau})=\textbf{0}$ . This, together with (B.42), (B.43) and the definition of $\bm{B}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}},\widehat{\bm{\theta}}_{\tau})$ , proves (2.19). ∎

B.5 Proof of Theorem 2.4

The proof is based on a similar argument to that used in the proof of Theorem 2.2. To begin with, define the bootstrap random process: for $\bm{\theta},\bm{\theta}^{\prime}\in\Theta_{0}(r)$ ,

[TABLE]

By the mean value theorem, $R^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})=(\bm{\theta}-\bm{\theta}^{\prime})^{\intercal}\nabla_{\bm{\theta}}R^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})|_{\bm{\theta}=\widetilde{\bm{\theta}}}$ , where $\widetilde{\bm{\theta}}$ is a convex combination of $\bm{\theta}$ and $\bm{\theta}^{\prime}$ and thus satisfies $\widetilde{\bm{\theta}}\in\Theta_{0}(r)$ . It follows that

[TABLE]

where $\bm{G}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime}):=\bm{\Sigma}^{-1/2}\nabla_{\bm{\theta}}R^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})=\bm{\Sigma}^{-1/2}\{\nabla\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\bm{\theta})-\nabla\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\bm{\theta}^{\prime})\}-n\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{\prime})$ . To bound the right-hand side of (B.45), we will deal with

[TABLE]

separately, where $\bm{G}(\bm{\theta},\bm{\theta}^{\prime})=\mathbb{E}^{*}\{\bm{G}^{\mathrm{\tiny{b}}}(\bm{\theta},\bm{\theta}^{\prime})\}=\bm{\Sigma}^{-1/2}\{\nabla\mathcal{L}_{\tau}(\bm{\theta})-\nabla\mathcal{L}_{\tau}(\bm{\theta}^{\prime})\}-n\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{\prime})$ . For the latter, we have

[TABLE]

where $\Delta(r)$ is given in (B.13). For the former term, note that

[TABLE]

Define new variables $\bm{\delta}=\bm{\Sigma}^{1/2}(\bm{\theta}-\bm{\theta}^{*})$ and $\bm{\delta}^{\prime}=\bm{\Sigma}^{1/2}(\bm{\theta}^{\prime}-\bm{\theta}^{*})$ , so that

[TABLE]

where $\overline{\bm{D}}(\bm{\delta},\bm{\delta}^{\prime})=\sum_{i=1}^{n}\{\psi_{\tau}(\varepsilon_{i}-\bm{Z}_{i}^{\intercal}\bm{\delta})-\psi_{\tau}(\varepsilon_{i}-\bm{Z}_{i}^{\intercal}\bm{\delta}^{\prime})\}U_{i}\bm{Z}_{i}$ . It is easy to see that the random process $\{\overline{\bm{D}}(\bm{\delta},\textbf{0}),\bm{\delta}\in\mathbb{B}^{d}(r)\}$ satisfies $\overline{\bm{D}}(\textbf{0},\textbf{0})=\textbf{0}$ and $\mathbb{E}^{*}\{\overline{\bm{D}}(\bm{\delta},\textbf{0})\}=\textbf{0}$ . Moreover, for any $\bm{u},\bm{v}\in\mathbb{S}^{d-1}$ and $\lambda\in\mathbb{R}$ ,

[TABLE]

It then follows from Theorem A.3 in Spokoiny (2013) that

[TABLE]

Together, the estimates (B.45)–(B.48) imply that, with $\mathbb{P}^{*}$ -probability at least $1-e^{-t}$ ,

[TABLE]

Recall the proof of Theorem 2.3 and note that

[TABLE]

This, together with (B.46)–(B.49) and the proof of Theorem 2.3, yields that, conditioning on some event that occurs with $\mathbb{P}^{\dagger}$ -probability at least $1-5e^{-t}$ ,

[TABLE]

for some $C_{10}=C_{10}(A_{0})>0$ and moreover, the following inequalities

[TABLE]

and

[TABLE]

hold with $\mathbb{P}^{*}$ -probability greater than $1-4e^{-t}$ . Recall that $\nabla\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})=\textbf{0}$ and by (2.20), $\bm{\Sigma}^{-1/2}\nabla\mathcal{L}_{\tau}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})=\bm{\xi}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})$ . It then follows that

[TABLE]

Putting $\Delta^{\mathrm{\tiny{b}}}(r)=\sup_{\bm{\theta}\in\Theta_{0}(r)}\|\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta})-\bm{\xi}^{\mathrm{\tiny{b}}}(\bm{\theta}^{*})\|_{2}$ and combining the last three displays, we conclude that

[TABLE]

for some $C_{11}=C_{11}(A_{0},A_{U})>0$ . This proves (2.21) immediately.

For the square-root Wilks expansion, note that

[TABLE]

where $R^{\mathrm{\tiny{b}}}(\cdot,\cdot)$ is given in (B.44). For any $r>0$ , similarly to (B.45), it holds

[TABLE]

Again, the estimates (B.46)–(B.48) imply that, with $\mathbb{P}^{*}$ -probability at least $1-e^{-t}$ ,

[TABLE]

where $\Delta(r)$ is given in (B.13). Recall that $\bm{G}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau},\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})=\bm{\Sigma}^{-1/2}\nabla\mathcal{L}^{\mathrm{\tiny{b}}}_{\tau}(\widehat{\bm{\theta}}_{\tau})-n\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}-\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})=\bm{\xi}^{\mathrm{\tiny{b}}}(\widehat{\bm{\theta}}_{\tau})-n\bm{\Sigma}^{1/2}(\widehat{\bm{\theta}}_{\tau}-\widehat{\bm{\theta}}_{\tau}^{\mathrm{\tiny{b}}})$ . Following the same argument that delivers (B.50), we reach

[TABLE]

where $C_{12}=C_{12}(A_{0},A_{U})>0$ . This is the bound stated in (2.22). ∎

B.6 Proof of Theorem 2.5

We divide the proof into three steps. In the first step, we revisit the non-asymptotic square-root Wilks approximations for the excess loss and its bootstrap counterpart. The second step is on Gaussian approximation for the $\ell_{2}$ -norm of the standardized score vector $\bm{\Sigma}^{-1/2}\nabla\mathcal{L}_{\tau}(\bm{\theta}^{*})$ . The last step links the distributions of the excess loss and its bootstrap counterpart via a Gaussian comparison inequality. Without loss of generality, we assume $t\geq 1$ throughout the proof.

Step 1 (Wilks approximations). Define $\bm{\xi}^{*}\!=\!-\sum_{i=1}^{n}\xi_{i}\bm{Z}_{i}$ and recall that $\bm{\xi}^{\mathrm{\tiny{b}}}=-\sum_{i=1}^{n}\xi_{i}U_{i}\bm{Z}_{i}$ .

For any $x\geq 0$ , it follows from (2.9) that

[TABLE]

where $R_{1}>0$ satisfies $R_{1}\asymp v(d+t)n^{-1/2}$ . Similarly, applying (2.22) yields that, with probability (over $\mathcal{D}_{n}$ ) at least $1-5e^{-t}$ ,

[TABLE]

where $R_{2}>0$ satisfies $R_{2}\asymp v(d+t)n^{-1/2}$ . In the following two steps, we validate the approximation of the distribution of $\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}$ by that of $\|\bm{\xi}^{*}\|_{2}$ in the Kolmogorov distance. To that end, define random vectors

[TABLE]

In this notation, we have $\|\bm{\xi}^{*}\|_{2}=\sqrt{n}\|\mathbf{S}_{1}\|_{2}$ and $\|\bm{\xi}^{\mathrm{\tiny{b}}}\|_{2}=\sqrt{n}\|\mathbf{S}_{2}\|_{2}$ .

Step 2 (Gaussian approximation for $\|\mathbf{S}_{1}\|_{2}$ ). Recall the truncated mean and second moment $m_{\tau}=\mathbb{E}(\xi_{i})$ and $\sigma^{2}_{\tau}=\mathbb{E}(\xi_{i}^{2})$ , and consider the centered sum

[TABLE]

where $\bm{\upsilon}=\mathbb{E}(\bm{Z})$ is such that $\|\bm{\upsilon}\|_{2}\leq 1$ . Here, $\bm{V}_{1},\ldots,\bm{V}_{n}$ are independent copies of the random vector $\bm{V}=\psi_{\tau}(\varepsilon)\bm{Z}\in\mathbb{R}^{d}$ with mean $m_{\tau}\bm{\upsilon}$ and covariance matrix $\bm{\Sigma}_{1}=\sigma^{2}_{\tau}\,\bm{I}_{d}-m_{\tau}^{2}\,\bm{\upsilon}\bm{\upsilon}^{\intercal}$ . For $(m_{\tau},\sigma_{\tau}^{2})$ , applying Proposition A.2 with $\kappa=4$ in the supplement of Zhou et al. (2018) gives

[TABLE]

Hence, for any $\bm{u}\in\mathbb{S}^{d-1}$ , it holds

[TABLE]

Taking $\tau=v\{n/(d+t)\}^{\eta}$ for $v\geq\upsilon_{4}^{1/4}$ , this implies $\overline{\lambda}_{\bm{\Sigma}_{1}}\leq\sigma^{2}$ and

[TABLE]

provided $n\geq(4\upsilon_{4}^{1/2}/\sigma^{2})^{1/(2\eta)}(d+t)$ . Also, under this scaling condition, it holds $\sigma_{\tau}^{2}\geq 3\sigma^{2}/4$ . It then follows from a multivariate central limit theorem (Bentkus, 2005) that

[TABLE]

where $\bm{G}_{1}\sim\mathcal{N}(\textbf{0},\bm{\Sigma}_{1})$ and $C_{1},C_{2}>0$ are absolute constants.

Let $\bm{G}_{0}\sim\mathcal{N}(\textbf{0},\bm{\Sigma}_{0})$ with $\bm{\Sigma}_{0}:=\sigma_{\tau}^{2}\,\bm{I}_{d}$ . Note that $\bm{\Sigma}_{0}^{-1/2}\bm{\Sigma}_{1}\bm{\Sigma}_{0}^{-1/2}\!-\!\bm{I}_{d}\!=\!m_{\tau}^{2}\sigma^{-2}_{\tau}\bm{\upsilon}\bm{\upsilon}^{\intercal}$ ,

[TABLE]

Applying Lemma A.7 in the supplementary material of Spokoiny and Zhilova (2015) gives

[TABLE]

provided $\delta_{\tau}\leq 1/2$ . In addition, the Gaussian random vector $\bm{G}_{0}$ satisfies the following anti-concentration inequality (Ball, 1993): for any $\epsilon\geq 0$ ,

[TABLE]

where $C_{3}>0$ is an absolute constant.

For the deterministic term $\mathbb{E}(\mathbf{S}_{1})=\sqrt{n}\,m_{\tau}\bm{\upsilon}$ , we have

[TABLE]

Combining this with (B.53)–(B.55), we arrive at

[TABLE]

Step 3 (Gaussian comparison). Note that, conditional on $\mathcal{D}_{n}$ , $\mathbf{S}_{2}$ follows a multivariate normal distribution with mean $\mathbb{E}(\mathbf{S}_{2}|\mathcal{D}_{n})=\textbf{0}$ and covariance matrix

[TABLE]

Applying Lemma A.2 with $x=2t$ yields that, with probability at least $1-e^{-t}$ ,

[TABLE]

as long as $n$ is sufficiently large, where $C_{4}=C_{4}(A_{0})>0$ . Hence, it follows from a conditional version of Lemma A.7 in the supplement of Spokoiny and Zhilova (2015) that, with probability (over $\mathcal{D}_{n}$ ) at least $1-e^{-t}$ ,

[TABLE]

In particular, taking $y=\max(x-R_{2},0)$ gives

[TABLE]

Combining the inequalities (B.51), (B.52), (B.56) and (B.57), we conclude that with probability (over $\mathcal{D}_{n}$ ) at least $1-6e^{-t}$ ,

[TABLE]

A similar argument leads to the reverse inequality and thus completes the proof by taking $z=x^{2}/2$ . ∎

B.7 Proof of Theorem 2.6

For $\alpha\in(0,1)$ , let $q^{\flat}_{\alpha}$ and $q_{\alpha}$ be the upper $\alpha$ -quantiles of

[TABLE]

respectively, under $\mathbb{P}^{*}$ and $\mathbb{P}$ . By the definitions of $z^{\flat}_{\alpha}$ and $z_{\alpha}$ in (2.25) and (2.17), it is easy to see that $z_{\alpha}^{\flat}=(q_{\alpha}^{\flat})^{2}/2$ almost surely and $z_{\alpha}=q_{\alpha}^{2}/2$ . According to Theorem 2.5, there exists an event $\mathcal{E}_{t}$ satisfying $\mathbb{P}(\mathcal{E}_{t})\geq 1-6e^{-t}$ such that

[TABLE]

and

[TABLE]

hold almost surely on $\mathcal{E}_{t}$ , where $\Delta_{1}=\Delta_{1}(n,d,t)$ . Together, these inequalities imply

[TABLE]

Next, define the Lévy concentration function of the non-negative random variable $T:=\sqrt{2\{\mathcal{L}_{\tau}(\bm{\theta}^{*})-\mathcal{L}_{\tau}(\widehat{\bm{\theta}}_{\tau})\}}$ :

[TABLE]

It then follows from (B.58) that

[TABLE]

Similarly, using (B.9) and the definition of $L(\cdot)$ , we get

[TABLE]

To complete the proof, it remains to bound $L(\epsilon)$ for any given $\epsilon>0$ . Keeping the notations used in the proof of Theorem 2.5, and following (B.51), (B.53) and (B.55), we obtain that for any $x\geq 0$ ,

[TABLE]

where $R_{1}\asymp v(d+t)n^{-1/2}$ is as in (B.51), $\overline{\bm{G}}_{1}\sim\mathcal{N}(\mathbb{E}(\mathbf{S}_{1}),\bm{\Sigma}_{1})$ and $C>0$ is an absolute constant.

Finally, combining (B.59), (B.60) and (B.61) to reach (2.26). ∎

Appendix C Proofs for Sections 3 and 4

C.1 Proof of Theorem 3.1

This proof is based on an argument similar to that used in the proof of Theorem 5.1 in Minsker (2018). Let $j^{*}=\min\{j\in\mathcal{J}:v_{j}\geq\sigma\}$ and note that $v_{j^{*}}\leq a\sigma$ . From the definition of $\widehat{j}_{{\rm L}}$ in (3.1) with $c_{0}\geq 2c_{1}\underline{\lambda}_{\bm{\Sigma}}^{-1/2}$ , we see that

[TABLE]

Define the event

[TABLE]

such that $\mathcal{B}\subseteq\{\widehat{j}_{{\rm L}}\leq j^{*}\}$ . Recalling Theorem 2.1, we have for any $v\geq\sigma$ , $\widehat{\bm{\theta}}_{\tau}$ with $\tau=v\sqrt{n/(d+t)}$ satisfies the bound

[TABLE]

with probability at least $1-3e^{-t}$ as long as $n\gtrsim d+t$ . Together with the union bound, this implies

[TABLE]

On the event $\mathcal{B}$ , $\widehat{j}_{{\rm L}}\leq j^{*}$ and thus

[TABLE]

Together, the last two displays lead to the stated result. ∎

C.2 Proof of Theorem 3.2

To begin with, define $\mathcal{D}_{n}^{(1)}$ and $\mathcal{D}_{n}^{(2)}$ to be the two independent samples $\{(Y_{i}^{(1)},\bm{X}_{i}^{(1)})\}_{i=1}^{n}$ and $\{(Y_{i}^{(2)},\bm{X}_{i}^{(2)})\}_{i=1}^{n}$ , respectively, such that $\bar{\mathcal{D}}_{n}=\mathcal{D}_{n}^{(1)}\cup\mathcal{D}_{n}^{(2)}$ . Under the assumption that $\mathbb{E}(|\varepsilon|^{4+\delta})\leq\upsilon_{4+\delta}$ for some $\delta>0$ , we have $\mathbb{E}|Y-\mu_{Y}|^{4+\delta}<\infty$ . For each $j=1,\ldots,m$ with $m$ denoting the number of blocks, by Chebyshev’s inequality, one can show that for any $\delta\in(0,1/2]$ ,

[TABLE]

with probability at least $1-\delta$ . Then, it follows from a variant of Lemma 2 in Bubeck, Cesa-Bianchi and Lugosi (2013) that, with $m=\lfloor 8\log n+1\rfloor$ ,

[TABLE]

as long as $n\gtrsim\log n$ , where the probability is over the training set $\mathcal{D}_{n}^{(1)}$ . Therefore, with the same probability (over $\mathcal{D}_{n}^{(1)}$ ), $|\widehat{\upsilon}_{Y,{\rm mom}}-\upsilon_{Y}|\leq\upsilon_{Y}/2$ for all sufficiently large $n$ . Then, it follows that, with high probability over $\mathcal{D}_{n}^{(1)}$ ,

[TABLE]

where the last inequality holds provided $K\geq\lfloor\log_{a}(3\upsilon_{Y}/\upsilon_{4})^{1/4}\rfloor+1$ . This, together with Theorem 3.1 with slight modifications, implies that with probability (over $\mathcal{D}_{n}^{(1)}$ ) at least $1-O(Kn^{-1})$ ,

[TABLE]

where $\tau^{*}:=\upsilon_{4}^{1/4}(\frac{n}{d+\log n})^{1/4}$ .

For the second step, write $\varepsilon_{i}^{(2)}=Y_{i}^{(2)}-\langle\bm{X}_{i}^{(2)},\bm{\theta}^{*}\rangle$ for $i=1,\ldots,n$ . Define random vectors

[TABLE]

Conditioning on the event that (C.1) holds, applying Theorems 2.2 and 2.4 we obtain that as long as $n\gtrsim d+\log n$ ,

[TABLE]

with probability (over $\mathcal{D}_{n}^{(2)}$ ) at least $1-O(n^{-1})$ , and

[TABLE]

with probability (over $\mathcal{D}_{n}^{(2)}$ and $\{W_{i}\}_{i=1}^{n}$ ) at least $1-O(n^{-1})$ . With the above preparations, the stated result follows from the same argument as in the proof of Theorem 2.6. ∎

C.3 Proof of Theorem 4.1

The proof consists of two main steps.

Step 1 (Accuracy of bootstrap approximations). For each $1\leq k\leq m$ , write $\mathcal{D}_{kn}=\{(y_{ik},\bm{x}_{i})\}_{i=1}^{n}$ and $T_{k}^{\mathrm{\tiny{b}}}=\sum_{i=1}^{n}\psi_{\tau_{k}}(\varepsilon_{ik})U_{i}$ . Then, applying Theorem 2.3 with $\bm{X}=(1,\bm{x}^{\intercal})^{\intercal}$ and $\bm{\theta}^{*}=(\mu_{k},\bm{\beta}_{k}^{\intercal})^{\intercal}$ gives that, with probability (over $\mathcal{D}_{kn})$ at least $1-6/(nm)^{2}$ ,

[TABLE]

where $\delta_{2k}:=C_{1k}\,v_{k}\{s+\log(nm)\}n^{-1/2}$ and $C_{1k}=C_{1k}(A_{0},A_{U})>0$ . Observe that, conditional on $\mathcal{D}_{kn}$ , $n^{-1/2}T^{\mathrm{\tiny{b}}}_{k}$ follows a normal distribution with mean zero and variance $\widehat{\sigma}_{k,\tau_{k}}^{2}=(1/n)\sum_{i=1}^{n}\{\ell_{\tau_{k}}^{\prime}(\varepsilon_{ik})\}^{2}$ . With $\tau_{k}=v_{k}[n/\{s+2\log(nm)\}]^{1/3}$ , an argument similar to that used to derive Lemma A.2 may be employed to show that, with probability at least $1-(nm)^{-2}$ ,

[TABLE]

where $C_{2k}=C_{2k}(A_{0})>0$ . Combining (C.2), (C.3), Lemma A.7 in Spokoiny and Zhilova (2015) and the union bound, we conclude that with probability (over $\mathcal{D}_{kn})$ at least $1-7/(n^{2}m)$ ,

[TABLE]

for all $k=1,\ldots,m$ . Combining this with (A.20), (A.21) and taking $a_{n}=2\log(nm)$ in Lemma A.5, we conclude that on some event that occurs with probability at least $1-7/(n^{2}m)$ ,

[TABLE]

uniformly in $0\leq z/\sigma_{k}\leq o\{\min(n^{1/6},\sqrt{n}/\log m)\}$ and $1\leq k\leq m$ .

Step 2 (FDP control with bootstrap calibration). For $k=1,\ldots,m$ and $z\geq 0$ , define $\widehat{T}_{k}=\sqrt{n}\,\widehat{\mu}_{k}$ , $G(z)=2\{1-\Phi(z)\}$ ,

[TABLE]

In this notation, we have $p^{\mathrm{\tiny{b}}}_{k}=G^{\mathrm{\tiny{b}}}_{k}(|\widehat{T}_{k}|)$ for $k=1,\ldots,m$ . As a direct consequence of Lemma 1 in Storey, Taylor and Siegmund (2004), the BH procedure with $p$ -values $\{p^{\mathrm{\tiny{b}}}_{k}\}_{k=1}^{m}$ is equivalent to Storey’s procedure, that is, reject $H_{0k}$ if and only if $p^{\mathrm{\tiny{b}}}_{k}\leq t^{\mathrm{\tiny{b}}}_{{\rm S}}$ , where

[TABLE]

By the definition of $t^{\mathrm{\tiny{b}}}_{{\rm S}}$ , we have

[TABLE]

For the bootstrap $p$ -values $p^{\mathrm{\tiny{b}}}_{k}$ and data-driven threshold $t^{\mathrm{\tiny{b}}}_{{\rm S}}$ , we claim that, as $(n,m)\to\infty$ ,

[TABLE]

for any sequence $b_{m}>0$ satisfying $b_{m}\to\infty$ and $b_{m}=o(m)$ , where $m_{1,\lambda_{0}}={\rm card}\{1\leq k\leq m:|\mu_{k}|/\sigma_{k}\geq\lambda_{0}\sqrt{(2\log m)/n}\,\}$ . Under condition (4.5), it follows

[TABLE]

which, together with (C.5), proves the stated result (4.6).

It remains to verify (C.6) and (C.7). By (C.5), it is clear that $t^{\mathrm{\tiny{b}}}_{{\rm S}}\in[\alpha/m,1]$ . Recall that $\log m=o(n^{1/3})$ . Then, by (C.4),

[TABLE]

uniformly in $1\leq k\leq m$ as $(n,m)\to\infty$ . Note that

[TABLE]

Combining the last two displays, we see that with probability tending to $1$ , $t^{\mathrm{\tiny{b}}}_{{\rm S}}\geq G^{\mathrm{\tiny{b}}}_{k}(\sigma_{k}\sqrt{2\log m})$ for all $1\leq k\leq m$ . It follows

[TABLE]

Furthermore,

[TABLE]

For any $\epsilon>0$ , define the event

[TABLE]

on which it holds

[TABLE]

Using Lemma A.5 and the union bound shows that, as $(n,m)\to\infty$ ,

[TABLE]

Putting the above calculations together leads to the claim (C.6).

Finally we verify (C.7). By Lemma A.5 and (C.4), it is easy to see that

[TABLE]

as $(n,m)\to\infty$ . Also, consider the event

[TABLE]

Similarly to (C.8), it can be shown that $\mathbb{P}(\mathcal{A}_{0}^{{\rm c}})\to 0$ . Consequently, there exits a sequence $\{\alpha_{n}\}_{n\geq 1}$ of positive numbers satisfying $\alpha_{n}\to 0$ such that

[TABLE]

Again, using Lemma A.5 gives

[TABLE]

Note that, with $0\leq z\leq\sigma_{k}\sqrt{2\log m}$ , it holds

[TABLE]

In (C.10), we change the variable by $t=G(z/\sigma_{k})$ to obtain

[TABLE]

By an argument similar to that in the proof of Proposition B.3 in Zhou et al. (2018), it follows that for any sequence $b_{m}>0$ satisfying $b_{m}\to\infty$ and $b_{m}=o(m)$ ,

[TABLE]

Together with (C.9), this proves (C.7) as desired. ∎

Appendix D Implementation

Since the bootstrap Huber estimator needs to be computed many times, an efficient optimization solver is critical for applications. Ideally, second order methods such as Newton’s method should be adopted due to fast convergence. Denote the gradient of the weighted Huber loss in (2.16) by

[TABLE]

Although $\bm{g}(\bm{\theta})$ is not differentiable everywhere with respect to $\bm{\theta}$ , we can still compute a generalized Jacobian of $\bm{g}(\bm{\theta})$ :

[TABLE]

which serves as an “approximate Hessian matrix”. Given (D.2), the generalized Newton method can be directly implemented via the following iterative procedure (for $t=1,2,\ldots$ ):

[TABLE]

where $\eta_{t}$ is the step-size. We note that the constraint in (2.16) is omitted here, since it is introduced mainly for theoretical analysis and will not affect the empirical performance.

Although (D.3) is easy to implement, there remains a practical issue that the Hessian matrix $\bm{H}(\bm{\theta}^{t})$ is not always invertible. To address this issue, we adopt the damped semismooth Newton method, which is a combination of Newton’s method and gradient descent. The idea is straightforward: when $\bm{H}(\bm{\theta}^{t})$ is invertible, $\bm{\theta}^{t+1}$ is computed via the generalized Newton step in (D.3); otherwise, the gradient descent step is performed, that is,

[TABLE]

The step-size $\eta_{t}$ is determined via the backtracking-Armijo line search rule.

Now we briefly discuss the the convergence of the damped semismooth Newton method. Note that the random weights $W_{i}$ may sometimes take negative values, our objective function could be non-convex, and thus we only discuss the convergence to a stationary point, i.e. some $\widehat{\bm{\theta}}$ such that $g(\widehat{\bm{\theta}})=0$ . The following proposition from Qi and Sun (1999) and De Luca and Facchinei and Kanzow (1996) provides the local convergence rate for solving a system $g(\bm{\theta})=0$ .

Proposition D.1.

Suppose that $g(\widehat{\bm{\theta}})=0$ , where $g$ is locally Lipschitz, and that all $V\in\partial g(\widehat{\bm{\theta}})$ are non-singular. If $g$ is strongly semismooth at $\widehat{\bm{\theta}}$ , then the method is quadratically convergent in a neighborhood of $\widehat{\bm{\theta}}$ .

Now, let us verify the conditions in Proposition D.1 for the weighted Huber regression. Given the Huber loss $\ell_{\tau}(x)$ and its gradient $\ell_{\tau}^{\prime}(x)=xI(|x|\leq\tau)+\tau\cdot{\rm sign}(x)I(|x|>\tau)$ , the Clarke’s generalized Jacobian of $\ell_{\tau}^{\prime}(x)$ (Hiriart-Urruty, 2001) can be calculated as

[TABLE]

The boundedness of $\ell_{\tau}^{\prime}(x)$ implies that $g$ in (D.1) is locally Lipschitz. Moreover, we can easily verify that $\ell_{\tau}^{\prime}(x)$ is a strongly semismooth function. Since the semi-smoothness is preserved under linear transformation, the function $g$ in (D.1) is also strongly semismooth. Then the remaining condition is on the non-singularity of $V\in\partial g(\widehat{\bm{\theta}})$ , where $\partial$ denotes the Clarke’s generalized Jacobian of $g$ . According to (D.5), we have

[TABLE]

We note that $\bm{H}$ in (D.2) is also a member of $\partial g(\widehat{\bm{\theta}})$ . The non-singularity condition depends on the realization of random weights $W_{i}$ and $\widehat{\bm{\theta}}$ . However, since the dimension $d$ is small as compared to $n$ , $W_{i}$ and $(Y_{i},\bm{X}_{i})$ are IID random variables, as long as $\widehat{\bm{\theta}}$ is not too extreme (e.g. there are at least $d$ terms such that $|Y_{i}-\bm{X}_{i}^{\intercal}\widehat{\bm{\theta}}|<\tau$ ), the non-singularity condition will be easily satisfied.

Appendix E Selecting robustification parameter: A data-driven approach

E.1 Preliminaries

Let $X$ be a real-valued random variable with finite variance. For $z\geq 0$ , define

[TABLE]

where $\psi_{z}(x)=(|x|\wedge z)\mathop{\mathrm{sign}}(x)$ , $x\in\mathbb{R}$ . Moreover, for $z>0$ , we define

[TABLE]

It is easy to see that $Q(z)=P(z)+z^{2}G(z)$ and $q(z)=p(z)+G(z)$ . The following result provides some useful connections among these functions. See (2.3) and (2.4) in Hahn, Kuelbs and Weiner (1990). We reproduce them here for the sake of readability.

Lemma E.1.

Let functions $G,Q,p$ and $q$ be given in (E.1) and (E.2).

(i)

The function $Q:[0,\infty)\to\mathbb{R}$ is non-decreasing with $\lim_{z\to\infty}Q(z)=\mathbb{E}(X^{2})$ . For any $z>0$ , we have

[TABLE]

and

[TABLE] 2. (ii)

The function $q:(0,\infty)\to\mathbb{R}$ is non-increasing and positive everywhere with $q(0+):=\lim_{s\downarrow 0}q(s)=\mathbb{P}(X\neq 0)$ . Moreover,

[TABLE]

for all $0\leq s\leq\Delta:=\inf\{y>0:G(y)<\mathbb{P}(X\neq 0)\}$ , and $q(s)$ decreases strictly and continuously on $(\Delta,\infty)$ with $\lim_{z\to\infty}q(z)=0$ .

Proof of Lemma E.1.

Note that

[TABLE]

Taking expectations on both sides implies $Q(z)=\mathbb{E}(|X|\wedge z)^{2}=2\int_{0}^{z}\mathbb{P}(|X|>y)y\,dy=2\int_{0}^{z}yG(y)dy$ , as stated. It follows that $Q^{\prime}(z)=2zG(z)$ and thus $Q$ is non-decreasing. Moreover, by the monotone convergence theorem we see that $\lim_{z\to\infty}Q(z)=\mathbb{E}(X^{2})$ .

Next, taking derivatives with respect to $z$ on both sides of (E.2) gives $2zq(z)+z^{2}q^{\prime}(z)=2zG(z)=2z\{q(z)-p(z)\}$ , which proves the the second equation in (E.3). To prove (E.4), note that, for any $0<s<z$ , $q(z)=q(s)-2\int_{s}^{z}y^{-1}p(y)\,dy$ . On event $\{X\neq 0\}$ , it holds almost surely that

[TABLE]

By the dominated convergence theorem,

[TABLE]

as $s\to 0$ . In the equation $q(z)=q(s)-2\int_{s}^{z}y^{-1}p(y)\,dy$ for $0<s<z$ , letting $s$ tend to zero proves (E.4).

Move to part (ii), by the definition of $\Delta$ , we have $\mathbb{P}(0<|X|\leq y)=0$ and thus $p(y)=0$ for all $0<y<\Delta$ . This, together with (E.4), implies $q(s)=\mathbb{P}(X\neq 0)>0$ for all $0\leq s\leq\Delta$ . It is easy to see that $p(y)>0$ for any $y>\Delta$ , and therefore $q(\cdot)$ is strictly decreasing on $(\Delta,\infty)$ . Finally, note that

[TABLE]

By the dominated convergence theorem, $\lim_{z\to\infty}q(z)=0$ as desired. ∎

E.2 Catoni’s lower bound of sample mean

Let $X_{1},\ldots,X_{n}$ be IID random variables from $X$ with mean zero and variance $\sigma^{2}>0$ . Let $\mathcal{A}_{\sigma^{2}}$ be the set of probability measures on the real line with variance bounded by $\sigma^{2}$ . Catoni (2012) proved a lower bound for the deviations of the empirical mean $\bar{X}_{n}$ when the underlying distribution is the least favorable in $\mathcal{A}_{\sigma^{2}}$ : for any $t\geq 2e$ , there exists some distribution with mean zero and variance $\sigma^{2}$ such that the IID sample of size drawn from it satisfies

[TABLE]

with probability at least $2t^{-1}$ . This shows that the worst case deviations of $\bar{X}_{n}$ are suboptimal with heavy-tailed data.

E.3 Proof of Proposition 3.1

For any $\tau>0$ , note that $\psi_{\tau}(X_{i})$ ’s are independent random variables satisfying $|\psi_{\tau}(X_{i})|\leq\tau$ and $\mathbb{E}\psi_{\tau}^{2}(X_{i})=\sigma_{\tau}^{2}$ . By Bernstein’s inequality,

[TABLE]

with probability at least $1-2e^{-t}$ . Taking $\tau=\tau_{t}$ in the last display leads to the first inequality in (3.9), which, together with (3.6), proves the second one.

To prove (3.10), we first make a finite approximation of the interval $[1/2,3/2]$ using a sequence $\{c_{k}\}_{k=1}^{n}$ of equidistant points $c_{k}=1/2+k/n$ . Then for any $\tau_{t}/2\leq\tau\leq 3\tau_{t}/2$ with $\tau_{t}=\sigma_{\tau_{t}}\sqrt{n/t}$ , there exists some $1\leq k\leq n$ such that $|\tau-\tau_{t,k}|\leq\sigma_{\tau_{t}}(nt)^{-1/2}$ , where $\tau_{t,k}:=c_{k}\sigma_{\tau_{t}}\sqrt{n/t}$ . It follows that

[TABLE]

For every $1\leq k\leq n$ , we have

[TABLE]

with probability at least $1-2e^{-t}$ . By (3.6), $|\mu_{\tau_{t,k}}|\leq(\sigma^{2}-\sigma_{\tau_{t,k}}^{2})/\tau_{t,k}$ . Apply the union bound over $1\leq k\leq n$ to see that

[TABLE]

with probability at least $1-2ne^{-t}$ . Together, (E.6) and (E.7) prove (3.10). ∎

E.4 Proof of Proposition 3.2

Using the notation in Section E.1, equation (3.8) can be written as $q(\tau)=t/n$ . By Lemma E.1, the function $q$ satisfies $\max_{z\geq 0}q(z)=\lim_{z\to 0}q(z)=\mathbb{P}(|X|>0)$ , $\lim_{z\to\infty}q(z)=0$ and is strictly decreasing on $(\Delta,\infty)$ . Provided $t/n<\mathbb{P}(|X|>0)$ , equation (3.8) has a unique solution that lies in $(\Delta,\infty)$ .

By definition, this unique solution $\tau_{t}$ satisfies

[TABLE]

On the other hand, note that $\mathbb{E}(X^{2}\wedge\tau^{2})\geq\tau^{2}\mathbb{P}(|X|>\tau)$ for any $\tau>0$ . It follows that $\mathbb{P}(|X|>\tau_{t})\leq t/n$ , which implies $\tau_{t}\geq q_{t/n}$ . Substituting this into (E.8) gives $\tau_{t}^{2}\geq\mathbb{E}(X^{2}\wedge q_{t/n}^{2})(n/t)$ .

To prove Part (ii), recall that $q(\tau_{t})=t/n$ . Since $t/n\to 0$ and $q(z)$ strictly decreases to zero as $z\to\infty$ , we have $\tau_{t}\to\infty$ and therefore $\mathbb{E}(X^{2}\wedge\tau_{t}^{2})\to\sigma^{2}$ as $n\to\infty$ . ∎

E.5 Proof of Theorem 3.3

By Proposition 3.3, $\widehat{\tau}_{t}$ is uniquely determined and positive on the event $\{t<\sum_{i=1}^{n}I(|X_{i}|>0)\}$ . Under the condition $\mathbb{P}(X=0)=0$ and when $t<n$ , this event occurs with probability one. We divide the rest of the proof into four steps.

Step 1. Define functions

[TABLE]

Applying Lemma E.1 to $p_{n}$ and $q_{n}$ implies $q_{n}^{\prime}(z)=-2z^{-1}p_{n}(z)$ . Therefore,

[TABLE]

by change of variables $u=(z-\tau_{t})/\tau_{t}$ . By definition, $q_{n}(\widehat{\tau}_{t})=t/n=q(\tau_{t})$ . It then follows that

[TABLE]

For any $r\in(0,1)$ , it holds on the event $\{(\widehat{\tau}_{t}-\tau_{t})/\tau_{t}\geq r\}$ that

[TABLE]

Similarly, on the event $\{(\widehat{\tau}_{t}-\tau_{t})/\tau_{t}\leq-r\}$ , it holds

[TABLE]

Putting the above calculations together, we arrive at

[TABLE]

Set $\zeta_{i}=(X_{i}^{2}\wedge\tau_{t}^{2})/\tau_{t}^{2}$ such that $q_{n}(\tau_{t})-q(\tau_{t})=(1/n)\sum_{i=1}^{n}\{\zeta_{i}-\mathbb{E}(\zeta_{i})\}$ . Note that $0\leq\zeta_{i}\leq 1$ and $\mathbb{E}(\zeta_{i}^{2})\leq\mathbb{E}(X_{i}^{2}\wedge\tau_{t}^{2})/\tau_{t}^{2}=t/n$ . By Bernstein’s inequality, for any $u>0$ it holds

[TABLE]

On the other hand, applying Theorem 2.19 in de la Peña, Lai and Shao (2009) with $X_{i}=\zeta_{i}/n$ therein gives that, for any $0<u<t$ ,

[TABLE]

Step 2 (Controlling $R_{1}$ and $R_{2}$ ). Note that $R_{1}$ and $R_{2}$ can be written, respectively, as $R_{1}=(2/n)\sum_{i=1}^{n}\{\xi_{i}-\mathbb{E}(\xi_{i})\}$ and $R_{2}=-(2/n)\sum_{i=1}^{n}\{\eta_{i}-\mathbb{E}(\eta_{i})\}$ , where

[TABLE]

are bounded, non-negative random variables satisfying

[TABLE]

In addition,

[TABLE]

and

[TABLE]

Recall that $q(\tau_{t})=t/n$ . Again, by Theorem 2.19 in de la Peña, Lai and Shao (2009) we have, for any $v>0$ ,

[TABLE]

and

[TABLE]

Step 3 (Bounding $D_{1}$ and $D_{2}$ ). Starting with $D_{1}$ , by Lemma E.1 we have

[TABLE]

Similarly,

[TABLE]

Step 4. Together, (E.9) and (E.12)–(E.15) imply that, for any $0<r<1$ and $v>0$ ,

[TABLE]

Note that

[TABLE]

and

[TABLE]

Taking $v=(a_{1}\wedge a_{2})t/2$ for $a_{1}$ and $a_{2}$ as in (3.14), the right-hand side of (E.16) can further be bounded by

[TABLE]

Combining this with (E.10), (E.11) and (E.16) proves (3.13). ∎

Appendix F Additional Simulation Studies

F.1 Standard deviations of estimated quantiles

In this section, we report the standard deviations of the estimated quantiles for the results in Table 1 and Table 3; see Table 6 and Table 7 below, which correspond to Table 1 and Table 3, respectively. For each setting in Tables 6 and 7, the standard deviation of the estimated quantiles for the boot-Huber is slightly smaller than that for the boot-OLS method. In a sense both the two bootstrap-based methods are rather stable, although the latter one suffers from distorted empirical coverage due to heavy-tailedness. For settings in other tables in the main text, the observations are similar and thus we omit the details.

F.2 Correlated design

In this section, we consider some more challenging cases in which the designs are highly correlated and/or non-Gaussian. Specifically, we consider the following two scenarios:

The covariate vector $\bm{X}=(X_{1},\ldots,X_{d})^{\intercal}\in\mathbb{R}^{d}$ follows a multivariate uniform distribution $\mathrm{Unif}([0,1]^{d})$ with $\mathrm{Corr}(X_{j},X_{k})=0.5^{|j-k|}$ for $1\leq j\neq k\leq d$ . See Falk (1999) for the construction of a multivariate uniform distribution. Each component of $\bm{\theta}^{*}$ follows a Bernoulli distribution with probability 0.5, i.e. $\text{Ber}(0.5)$ . The results for this case are presented in Tables 8 (with $n=100$ ) and 9 (with $n=200$ ). 2. 2.

The covariate vector $\bm{X}$ follows $\mathcal{N}(\textbf{0},\bm{\Sigma})$ , where the covariance matrix $\bm{\Sigma}=(\sigma_{jk})_{1\leq j,k\leq d}$ has a Toeplitz structure with $\sigma_{jk}=0.9^{|j-k|}$ . The components of $\bm{\theta}^{*}$ are equally spaced in $[0,1]$ . The results for this case are presented in Tables 10 (with $n=100$ ) and 11 (with $n=200$ ).

From Tables 8–11 we find that the average coverage probabilities of the boot-Huber method are in general close to nominal levels, while the boot-OLS leads to severe under-coverage for many heavy-tailed noise settings.

F.3 Simulations on multiple testing

In this section, we evaluate the empirical performance of the proposed robust multiple testing procedure described in Algorithm 2. Recall the multi-response regression model (1.2):

[TABLE]

where $\bm{\beta}_{k}\in\mathbb{R}^{s}$ . We choose $\mu_{k}=\gamma\sigma\sqrt{(2\log m)/n}$ for $1\leq k\leq m_{1}$ with $m_{1}=0.05m$ and $\mu_{k}=0$ for $m_{1}+1\leq k\leq m$ , where $\sigma^{2}=\textnormal{var}(\varepsilon_{i})=1$ . The parameter $\gamma$ takes the value either 1.5 (i.e. the weaker signal strength case) or 3 (i.e. the stronger signal strength case). We generate $\{\bm{x}_{i}\}_{i=1}^{n}$ from $\mathcal{N}(\textbf{0},\bm{I}_{s})$ and $\bm{\beta}_{k}$ from the uniform distribution on $[-1,1]^{s}$ for $k=1,\ldots,m$ . The settings of error distributions are the same as in Section 5.1. The number of tests $m$ is set to be $1000$ . The bootstrap weights $\{w_{ik},1\leq i\leq n,1\leq k\leq m\}$ are IID from $\mathcal{N}(1,1)$ . For each setup, we report the average false discovery proportion and empirical power based on 1000 simulations. The FDP nominal level takes value in $\{5\%,10\%,15\%,20\%,25\%\}$ .

Tables 12 and 13 show the empirical FDPs and powers for the weaker signal case with $\mu_{k}=1.5\sqrt{2(\log m)/n}$ ; while Tables 14 and 15 show the results for the stronger signal case with $\mu_{k}=3\sqrt{2(\log m)/n}$ . Moreover, Tables 12 and 14 consider different error distributions when $n=100$ and $s=5$ . When the error is from a $t$ -Weilbull mixture distribution, Tables 13 and 15 present the results for different combinations of $(s,n)$ , revealing the influence of $s$ on the difficulty of the problem. In particular, the combination of $s=10$ , $n=100$ and ${\rm signal~{}strength}=1.5\sqrt{2(\log m)/n}$ corresponds to the most challenging scenario. Increasing either the sample size or the signal strength improves both the FDP control and power performance, which is consistent with our theoretical result in Theorem 4.1. In summary, with various types of heavy-tailed errors and across different settings, the proposed robust testing procedure performs well and steadily in terms of FDP control and power.

References

Adamczak et al. (2011)

Adamczak, R., Litvak, A. E., Pajor, A. and Tomczak-Jaegermann, N. (2011). Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. Constr. Approx. 34 61–88.

Ball (1993)

Ball, K. (1993). The reverse isoperimetric problem for Gaussian measure. Discrete Comput. Geom. 10 411–420.

Bentkus (2005)

Bentkus, V. (2005). A Lyapunov-type bound in $R^{d}$ . Theory Probab. Appl. 49 311–323.

Bousquet (2003)

Bousquet, O. (2003). Concentration inequalities for sub-additive functions using the entropy method. In Stochastic Inequalities and Applications. Progress in Probability 56 213–247. Birkhäuser, Basel.

Bubeck, Cesa-Bianchi and Lugosi (2013)

Bubeck, S., Cesa-Bianchi, N. and Lugosi, G. (2013). Bandits with heavy tail. IEEE Trans. Inform. Theory 59 7711–7717.

Catoni (2012)

Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48 1148–1185.

de la Peña, Lai and Shao (2009)

de la Peña, V. H., Lai, T. L. and Shao, Q.-M. (2009). Self-Normalized Processes: Limit Theory and Statistical Applications. Springer, Berlin.

De Luca and Facchinei and Kanzow (1996)

De Luca, T., Facchinei, F. and Kanzow, C. (1996). A semismooth equation approach to the solution of nonlinear complementarity problems. Math. Program. 75 407–439.

Devroye et al. (2016)

Devroye, L., Lerasle, M., Lugosi, G. and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. Ann. Statist. 44 2695–2725.

Falk (1999) Falk, M. (1999). A simple approach to the generation of uniformly distributed random variables with prescribed correlations. Comm. Statist. Simulation Comput. 28 785–791.
Fan, Li and Wang (2017) Fan, J., Li, Q. and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 247–265.
Fan et al. (2018)

Fan, J., Liu, H., Sun, Q. and Zhang, T. (2018).

I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. Ann. Statist. 96 1348–1360.

Hahn, Kuelbs and Weiner (1990)

Hahn, M. G., Kuelbs, J. and Weiner, D. C. (1990). The asymptotic joint distribution of self-normalized censored sums and sums of squares. Ann. Probab. 18 1284–1341.

Hiriart-Urruty (2001)

Hiriart-Urruty,, J. B. and Lemaréchal, C. (2015). Fundamentals of Convex Analysis. Springer-Verlag.

Ledoux and Talagrand (1991)

Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer-Verlag, Berlin.

Liu and Shao (2014)

Liu, W. and Shao, Q.-M. (2014). Phase transition and regularized bootstrap in large-scale $t$ -tests with false discovery rate control. Ann. Statist. 42 2003–2025.

Lepskiĭ (1991)

Lepskiĭ, O. V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Teor. Veroyatn. Primen. 36 645–659.

Loh and Wainwright (2015)

Loh, P.-L. and Wainwright, M. J. (2015). Regularized $M$ -estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16 559–616.

Minsker (2018)

Minsker, S. (2018). Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann. Statist. 46 2871–2903.

Pugh (2015)

Pugh, C. C. (2015). Real Mathematical Analysis, 2nd ed. Springer-Verlag, New York.

Qi and Sun (1999)

Qi, L. and Sun, D. (1999). A survey of some nonsmooth equations and smoothing Newton methods. In Progress in Optimization 121–146. Springer, Boston, MA.

Spokoiny (2012)

Spokoiny, V. (2012). Parametric estimation. Finite sample theory. Ann. Statist. 40 2877–2909.

Spokoiny (2013)

Spokoiny, V. (2013). Bernstein–von Mises theorem for growing parameter dimension. Preprint. Available at arXiv:1302.3430.

Spokoiny and Zhilova (2015) Spokoiny, V. and Zhilova, M. (2015). Bootstrap confidence sets under model misspecification. Ann. Statist. 43 2653–2675.
Storey, Taylor and Siegmund (2004) Storey, J. D., Taylor, J. E. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rate: A unified approach. J. R. Stat. Soc. Ser. B. Stat. Methodol. 66 187–205.
Vershynin (2018)

Vershynin, R. (2018) High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Univ. Press, Cambridge.

Zhou et al. (2018)

Zhou, W.-X., Bose, K., Fan, J. and Liu, H. (2018). A new perspective on robust $M$ -estimation: Finite sample theory and applications to dependence-adjusted multiple testing. Ann. Statist. 46 1904–1931.

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arlot, Blanchard and Roquain (2010) Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82.
2Audibert and Catoni (2011) Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
3Barras, Scaillet and Wermers (2010) Barras, L., Scaillet, O. and Wermers, R. (2010). False discoveries in mutual fund performance: Measuring luck in estimated alphas. J. Finance 65 179–216.
4Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 289–300.
5Berk and Green (2004) Berk, J. B. and Green, R. C. (2004). Mutual fund flows and performance in rational markets. J. Polit. Econ. 112 1269–1295.
6Brownlees, Joly and Lugosi (2015) Brownlees, C. , Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. Ann. Statist. 43 2507–2536.
7Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48 1148–1185.
8Catoni and Giulini (2017) Catoni, O. and Giulini, L. (2017). Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. Available at ar Xiv:1712.02747 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Robust Inference via Multiplier Bootstrap

Abstract

1 Introduction

1.1 Confidence sets

1.2 Simultaneous inference

1.3 Organization of the paper

1.4 Notation

2 Robust bootstrap confidence sets

2.1 Preliminaries

Definition 2.1**.**

Assumption 1**.**

Theorem 2.1**.**

Theorem 2.2**.**

Remark 1** (On the robustification parameter τ\tauτ).**

Remark 2** (Sample size scaling).**

Remark 3**.**

2.2 Multiplier bootstrap

2.3 Theoretical results

Assumption 2**.**

Theorem 2.3**.**

Theorem 2.4**.**

Remark 4** (Order of robustification parameter).**

Theorem 2.5**.**

Remark 5** (Asymptotic result).**

Theorem 2.6** (Validity of multiplier bootstrap).**

3 Data-driven procedures for choosing τ\tauτ

3.1 Lepski-type method

Theorem 3.1**.**

Theorem 3.2**.**

3.2 Huber-type method

3.2.1 Motivation: truncated mean

Proposition 3.1**.**

Proposition 3.2**.**

Proposition 3.3**.**

Theorem 3.3**.**

Remark 6**.**

Corollary 1**.**

3.2.2 Huber’s mean estimator

3.2.3 Data-driven Huber regression

Remark 7** (Equivalence between (3.18) and Huber regression).**

4 Multiple inference with multiplier bootstrap calibration

Assumption 3**.**

Theorem 4.1**.**

5 Numerical studies

5.1 Confidence sets

5.2 Performance of the data-driven tuning approach

6 Discussion

Supplementary Material

Appendix A Notations and Preliminaries

A.1 Notations

A.2 Technical lemmas

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Lemma A.3**.**

Proof.

Lemma A.4**.**

Proof.

Lemma A.5**.**

Proof.

Appendix B Proofs for Section 2

B.1 Proof of Theorem 2.1

Proposition B.1**.**

B.2 Proof of Theorem 2.2

B.3 Proof of Proposition B.1

B.4 Proof of Theorem 2.3

B.5 Proof of Theorem 2.4

B.6 Proof of Theorem 2.5

B.7 Proof of Theorem 2.6

Appendix C Proofs for Sections 3 and 4

C.1 Proof of Theorem 3.1

C.2 Proof of Theorem 3.2

Definition 2.1.

Assumption 1.

Theorem 2.1.

Theorem 2.2.

Remark 1 (On the robustification parameter $\tau$ ).

Remark 2 (Sample size scaling).

Remark 3.

Assumption 2.

Theorem 2.3.

Theorem 2.4.

Remark 4 (Order of robustification parameter).

Theorem 2.5.

Remark 5 (Asymptotic result).

Theorem 2.6 (Validity of multiplier bootstrap).

3 Data-driven procedures for choosing $\tau$

Theorem 3.1.

Theorem 3.2.

Proposition 3.1.

Proposition 3.2.

Proposition 3.3.

Theorem 3.3.

Remark 6.

Corollary 1.

Remark 7 (Equivalence between (3.18) and Huber regression).

Assumption 3.

Theorem 4.1.

Lemma A.1.

Lemma A.2.

Lemma A.3.

Lemma A.4.

Lemma A.5.

Proposition B.1.

Proposition D.1.

Lemma E.1.