A proximal dual semismooth Newton method for computing zero-norm   penalized QR estimator

Dongdong Zhang; Shaohua Pan; Shujun Bi

arXiv:1907.03435·math.OC·November 24, 2020

A proximal dual semismooth Newton method for computing zero-norm penalized QR estimator

Dongdong Zhang, Shaohua Pan, Shujun Bi

PDF

Open Access

TL;DR

This paper introduces a novel multi-stage convex relaxation method using a proximal dual semismooth Newton approach to efficiently compute high-dimensional zero-norm penalized quantile regression estimators, with theoretical guarantees and superior empirical performance.

Contribution

It develops a new multi-stage convex relaxation algorithm with a proximal dual semismooth Newton method for zero-norm penalized QR, providing theoretical error bounds and convergence analysis.

Findings

01

Achieves linear convergence rate under restricted strong convexity.

02

Outperforms existing methods in estimation accuracy and computational efficiency.

03

Demonstrates effectiveness on synthetic and real datasets.

Abstract

This paper is concerned with the computation of the high-dimensional zero-norm penalized quantile regression estimator, defined as a global minimizer of the zero-norm penalized check loss function. To seek a desirable approximation to the estimator, we reformulate this NP-hard problem as an equivalent augmented Lipschitz optimization problem, and exploit its coupled structure to propose a multi-stage convex relaxation approach (MSCRA\_PPA), each step of which solves inexactly a weighted $ℓ_{1}$ -regularized check loss minimization problem with a proximal dual semismooth Newton method. Under a restricted strong convexity condition, we provide the theoretical guarantee for the MSCRA\_PPA by establishing the error bound of each iterate to the true estimator and the rate of linear convergence in a statistical sense. Numerical comparisons on some synthetic and real data show that MSCRA\_PPA…

Tables7

Table 1. Table 1: Identification performance of MSCRA_PPA

		$n = 250$	$n = 300$	$n = 400$	$n = 500$
$τ = 0.3$	Size	11.800(4.369)	9.320(3.146)	6.290(1.472)	5.330(0.697)
	$P_{1}$	0.81	0.83	0.93	0.91
	$P_{2}$	0.81	0.83	0.93	0.91
	AE	0.197(0.174)	0.170(0.165)	0.176(0.155)	0.145(0.127)
$τ = 0.5$	Size	10.960(3.075)	7.910(2.060)	5.270(1.171)	4.370(0.597)
	$P_{1}$	1.00	1.00	1.00	1.00
	$P_{2}$	0.00	0.00	0.00	0.00
	AE	0.034(0.014)	0.027(0.011)	0.021(0.010)	0.018(0.008)
$τ = 0.7$	Size	12.590(4.356)	8.320(2.169)	6.310(1.308)	5.380(0.693)
	$P_{1}$	0.79	0.88	0.91	0.93
	$P_{2}$	0.79	0.88	0.91	0.93
	AE	0.183(0.175)	0.220(0.180)	0.151(0.146)	0.162(0.142)

Table 2. Table 2: Estimation and selection performance of three solvers for Σ x = I subscript Σ 𝑥 𝐼 \Sigma_{x}=I

$ε$	Method	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)
		$τ = 0.5$					$τ = 0.75$
$𝒩 (0, 2)$	IPM	0.104	0.444(0.107)	5.100(2.057)	0.730(0.468)	4.221	0.110	0.523(0.157)	7.840(3.034)	0.670(0.514)	5.613
	ADMM	0.104	0.446(0.106)	5.100(2.028)	0.730(0.468)	3.033	0.110	0.523(0.158)	7.760(3.079)	0.670(0.514)	3.847
	PPA	0.116	0.446(0.119)	1.920(1.228)	0.800(0.426)	0.138	0.119	0.557(0.188)	3.810(1.937)	0.840(0.420)	0.202
${MN}_{1}$	IPM	0.104	0.345(0.066)	5.030(2.007)	0.410(0.494)	3.566	0.110	0.377(0.078)	6.860(2.741)	0.490(0.502)	4.168
	ADMM	0.104	0.345(0.067)	5.150(2.110)	0.410(0.494)	2.601	0.110	0.377(0.078)	6.890(2.723)	0.480(0.502)	3.062
	PPA	0.110	0.347(0.066)	3.260(1.779)	0.510(0.502)	0.131	0.116	0.375(0.061)	5.050(2.333)	0.590(0.494)	0.191
${MN}_{2}$	IPM	0.104	1.425(0.361)	6.750(2.955)	1.860(0.921)	5.558	0.122	1.764(0.501)	4.220(2.377)	2.660(1.085)	5.568
	ADMM	0.104	1.427(0.356)	6.760(3.114)	1.880(0.902)	3.829	0.122	1.749(0.512)	4.270(2.432)	2.670(1.064)	3.825
	PPA	0.116	1.347(0.343)	2.480(1.823)	2.320(0.994)	0.133	0.134	1.742(0.537)	1.790(1.690)	3.260(1.050)	0.151
Laplace	IPM	0.098	0.324(0.071)	7.410(2.775)	0.220(0.416)	3.835	0.110	0.364(0.089)	6.550(2.484)	0.410(0.494)	3.789
	ADMM	0.098	0.324(0.070)	7.450(2.797)	0.220(0.416)	2.709	0.110	0.365(0.089)	6.580(2.458)	0.400(0.492)	2.761
	PPA	0.104	0.326(0.073)	4.700(2.209)	0.280(0.451)	0.144	0.116	0.382(0.094)	4.970(2.158)	0.480(0.502)	0.204
$\sqrt{2} \times t_{4}$	IPM	0.104	0.487(0.139)	5.330(2.301)	0.760(0.474)	4.677	0.110	0.649(0.238)	7.300(2.880)	0.840(0.507)	4.907
	ADMM	0.104	0.487(0.138)	5.360(2.325)	0.760(0.474)	3.214	0.110	0.647(0.239)	7.360(2.812)	0.840(0.507)	3.340
	PPA	0.110	0.502(0.180)	3.160(1.587)	0.790(0.478)	0.157	0.122	0.684(0.286)	2.970(1.861)	1.010(0.643)	0.239
Cauchy	IPM	0.098	0.536(0.217)	8.340(3.019)	0.670(0.533)	4.954	0.110	0.730(0.364)	6.740(2.493)	1.000(0.765)	5.488
	ADMM	0.098	0.531(0.216)	8.340(2.879)	0.680(0.530)	2.989	0.110	0.729(0.360)	6.720(2.551)	1.010(0.759)	3.404
	PPA	0.116	0.560(0.274)	1.780(1.203)	0.910(0.637)	0.166	0.125	0.816(0.381)	2.760(1.837)	1.280(0.792)	0.243

Table 3. Table 3: Estimation and selection performance of three solvers for AR 0.5 subscript AR 0.5 {\rm AR}_{0.5}

$ε$	Method	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)
		$τ = 0.5$					$τ = 0.75$
$𝒩 (0, 2)$	IPM	0.104	0.467(0.119)	4.650(2.148)	0.710(0.456)	3.744	0.110	0.609(0.222)	6.830(2.843)	0.800(0.512)	4.312
	ADMM	0.104	0.474(0.120)	4.620(2.112)	0.730(0.446)	2.553	0.110	0.606(0.214)	6.860(2.853)	0.800(0.512)	3.143
	PPA	0.110	0.491(0.145)	2.810(1.594)	0.760(0.474)	0.133	0.122	0.591(0.199)	3.020(1.664)	0.870(0.442)	0.201
${MN}_{1}$	IPM	0.098	0.365(0.074)	7.020(2.515)	0.410(0.494)	3.661	0.110	0.399(0.076)	6.450(2.679)	0.570(0.498)	3.729
	ADMM	0.098	0.367(0.073)	7.070(2.536)	0.400(0.492)	2.746	0.110	0.399(0.076)	6.500(2.676)	0.570(0.498)	2.819
	PPA	0.098	0.366(0.073)	7.060(2.566)	0.410(0.494)	0.139	0.122	0.423(0.127)	3.390(1.959)	0.630(0.485)	0.180
${MN}_{2}$	IPM	0.104	1.383(0.394)	4.990(2.472)	2.060(0.930)	5.168	0.122	1.665(0.434)	3.640(2.013)	2.610(0.920)	5.339
	ADMM	0.104	1.379(0.384)	5.220(2.747)	2.010(0.937)	3.446	0.122	1.679(0.420)	3.670(2.080)	2.590(0.911)	3.764
	PPA	0.119	1.365(0.420)	1.590(1.436)	2.490(0.937)	0.101	0.131	1.705(0.512)	2.100(1.755)	3.010(0.959)	0.167
Laplace	IPM	0.098	0.349(0.089)	7.250(2.564)	0.360(0.482)	3.818	0.110	0.381(0.099)	6.320(2.624)	0.580(0.496)	4.513
	ADMM	0.098	0.349(0.089)	7.250(2.591)	0.360(0.482)	2.851	0.110	0.381(0.099)	6.380(2.666)	0.570(0.498)	3.130
	PPA	0.104	0.352(0.088)	4.600(2.079)	0.410(0.494)	0.125	0.116	0.408(0.154)	4.610(2.188)	0.480(0.522)	0.209
$\sqrt{2} \times t_{4}$	IPM	0.104	0.534(0.165)	4.580(2.142)	0.830(0.473)	4.341	0.110	0.734(0.291)	6.920(2.990)	1.070(0.573)	5.785
	ADMM	0.104	0.533(0.165)	4.590(2.109)	0.830(0.473)	3.179	0.110	0.736(0.288)	6.860(3.052)	1.070(0.573)	3.891
	PPA	0.110	0.542(0.180)	3.020(1.723)	0.860(0.472)	0.129	0.122	0.710(0.283)	3.240(1.782)	1.150(0.575)	0.209
Cauchy	IPM	0.101	0.544(0.245)	6.130(2.232)	0.820(0.539)	4.912	0.104	0.695(0.343)	9.450(3.105)	0.980(0.681)	5.948
	ADMM	0.104	0.538(0.258)	4.890(2.136)	0.860(0.513)	2.952	0.104	0.693(0.335)	9.530(2.883)	0.950(0.672)	3.686
	PPA	0.116	0.561(0.280)	1.740(1.292)	0.980(0.603)	0.169	0.122	0.879(0.473)	3.270(1.814)	1.430(0.956)	0.233

Table 4. Table 4: Estimation and selection performance of three solvers for AR 0.8 subscript AR 0.8 {\rm AR}_{0.8}

$ε$	Method	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)
		$τ = 0.5$					$τ = 0.75$
$𝒩 (0, 2)$	IPM	0.095	0.852(0.361)	7.050(2.504)	1.260(0.733)	4.117	0.098	0.986(0.408)	10.740(3.852)	1.400(0.804)	6.170
	ADMM	0.092	0.835(0.336)	8.800(2.723)	1.240(0.698)	3.306	0.098	0.996(0.404)	10.940(3.961)	1.400(0.816)	4.721
	PPA	0.110	0.910(0.404)	2.390(1.550)	1.520(0.731)	0.111	0.110	0.965(0.387)	5.140(2.454)	1.440(0.701)	0.193
${MN}_{1}$	IPM	0.098	0.530(0.208)	5.300(2.368)	0.780(0.504)	3.683	0.098	0.622(0.254)	9.510(4.036)	0.850(0.557)	5.205
	ADMM	0.092	0.519(0.184)	8.460(2.844)	0.770(0.489)	2.933	0.098	0.625(0.261)	9.630(4.099)	0.850(0.557)	3.851
	PPA	0.104	0.550(0.227)	3.550(1.977)	0.800(0.512)	0.132	0.110	0.644(0.321)	5.120(2.363)	1.000(0.682)	0.184
${MN}_{2}$	IPM	0.104	1.742(0.616)	4.350(2.086)	2.590(0.889)	4.362	0.122	2.113(0.641)	3.120(1.981)	3.020(0.995)	5.187
	ADMM	0.104	1.713(0.642)	4.560(2.203)	2.500(0.959)	3.187	0.116	2.139(0.629)	4.230(2.155)	2.970(0.958)	4.269
	PPA	0.140	1.809(0.649)	0.820(0.936)	2.920(0.929)	0.085	0.152	2.125(0.721)	0.940(0.886)	3.290(0.868)	0.126
Laplace	IPM	0.098	0.520(0.257)	5.810(2.639)	0.720(0.637)	3.767	0.104	0.650(0.375)	6.980(3.291)	0.980(0.710)	3.990
	ADMM	0.098	0.510(0.242)	5.880(2.626)	0.710(0.608)	2.864	0.104	0.645(0.370)	7.140(3.333)	0.970(0.703)	3.180
	PPA	0.104	0.543(0.267)	3.780(2.177)	0.840(0.615)	0.124	0.116	0.679(0.386)	3.710(2.176)	1.150(0.716)	0.167
$\sqrt{2} \times t_{4}$	IPM	0.095	0.955(0.412)	7.180(2.754)	1.470(0.658)	4.517	0.098	1.135(0.465)	10.250(4.029)	1.660(0.831)	5.201
	ADMM	0.092	0.934(0.407)	8.700(3.125)	1.410(0.653)	3.236	0.098	1.135(0.485)	10.400(3.929)	1.660(0.867)	3.641
	PPA	0.110	1.009(0.400)	2.570(1.736)	1.630(0.646)	0.118	0.110	1.190(0.542)	5.450(2.516)	1.870(0.939)	0.194
Cauchy	IPM	0.104	0.891(0.452)	3.440(2.134)	1.420(0.684)	3.853	0.110	1.168(0.573)	4.970(2.676)	1.790(0.946)	4.842
	ADMM	0.098	0.850(0.435)	5.590(2.586)	1.320(0.723)	2.672	0.110	1.153(0.549)	4.950(2.668)	1.770(0.908)	2.901
	PPA	0.116	0.962(0.452)	1.380(1.237)	1.570(0.700)	0.157	0.122	1.138(0.570)	2.920(1.895)	1.800(0.921)	0.205

Table 5. Table 5: Estimation and selection performance of three solvers for CS 0.5 subscript CS 0.5 {\rm CS}_{0.5}

$ε$	Method	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)
		$τ = 0.5$					$τ = 0.75$
$𝒩 (0, 2)$	IPM	0.092	0.683(0.266)	1.710(1.597)	1.130(0.464)	3.819	0.092	0.943(0.366)	3.810(2.759)	1.340(0.685)	4.533
	ADMM	0.092	0.700(0.272)	1.750(1.459)	1.140(0.472)	3.336	0.098	0.962(0.388)	2.780(2.245)	1.450(0.757)	3.761
	PPA	0.104	0.744(0.282)	0.650(0.880)	1.260(0.543)	0.195	0.116	0.934(0.347)	1.020(1.163)	1.580(0.684)	0.227
${MN}_{1}$	IPM	0.092	0.437(0.093)	1.300(1.243)	0.810(0.394)	3.366	0.098	0.505(0.157)	2.070(1.816)	0.840(0.368)	3.687
	ADMM	0.098	0.441(0.097)	0.730(0.777)	0.820(0.386)	2.981	0.098	0.506(0.148)	2.030(1.702)	0.840(0.368)	3.475
	PPA	0.104	0.448(0.107)	0.350(0.557)	0.930(0.293)	0.178	0.116	0.523(0.192)	0.420(0.867)	1.020(0.200)	0.235
${MN}_{2}$	IPM	0.110	1.919(0.526)	2.320(1.999)	3.090(0.877)	3.447	0.122	2.253(0.492)	2.690(1.813)	3.550(0.744)	3.224
	ADMM	0.122	1.977(0.490)	3.210(2.271)	3.100(0.882)	3.088	0.143	2.268(0.451)	3.800(2.094)	3.530(0.745)	3.241
	PPA	0.152	2.016(0.545)	1.650(1.480)	3.410(0.866)	0.117	0.155	2.444(0.579)	2.600(1.717)	3.830(0.842)	0.170
Laplace	IPM	0.086	0.445(0.140)	2.390(2.117)	0.810(0.394)	3.926	0.098	0.568(0.253)	2.290(2.027)	1.010(0.414)	3.868
	ADMM	0.086	0.445(0.139)	2.520(2.134)	0.800(0.402)	3.773	0.092	0.559(0.212)	3.480(2.552)	0.920(0.442)	3.889
	PPA	0.098	0.469(0.167)	0.930(1.380)	0.910(0.379)	0.181	0.104	0.586(0.279)	1.570(2.171)	1.110(0.510)	0.250
$\sqrt{2} \times t_{4}$	IPM	0.092	0.874(0.352)	1.960(1.780)	1.400(0.651)	4.345	0.092	1.206(0.486)	4.150(2.724)	1.710(0.868)	4.657
	ADMM	0.086	0.905(0.339)	3.600(2.229)	1.310(0.598)	4.071	0.095	1.259(0.448)	3.760(2.527)	1.800(0.791)	3.875
	PPA	0.110	0.966(0.347)	0.910(1.215)	1.610(0.680)	0.165	0.116	1.172(0.429)	1.290(1.241)	1.980(0.816)	0.216
Cauchy	IPM	0.086	0.803(0.377)	3.050(2.208)	1.330(0.620)	5.123	0.092	1.239(0.575)	3.910(2.016)	1.900(0.859)	5.142
	ADMM	0.092	0.896(0.436)	2.270(1.869)	1.480(0.674)	3.599	0.095	1.392(0.592)	4.190(2.608)	2.040(0.887)	3.471
	PPA	0.101	0.880(0.415)	1.200(1.198)	1.460(0.658)	0.278	0.113	1.237(0.502)	1.470(1.540)	2.030(0.834)	0.333

Table 6. Table 6: Estimation and selection performance of three solvers for CS 0.8 subscript CS 0.8 {\rm CS}_{0.8}

$ε$	Method	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)	$γ_{opt}$	$L_{2}$ -error	FP	FN	Time(s)
		$τ = 0.5$					$τ = 0.75$
$𝒩 (0, 2)$	IPM	0.092	1.572(0.411)	1.020(1.263)	2.630(0.761)	2.879	0.098	1.803(0.469)	1.480(1.337)	2.890(0.840)	2.907
	ADMM	0.131	1.683(0.365)	2.050(1.617)	2.820(0.796)	2.979	0.116	1.923(0.462)	3.050(2.057)	2.950(0.903)	3.077
	PPA	0.140	1.709(0.423)	0.650(1.029)	3.010(0.759)	0.229	0.140	1.939(0.460)	1.210(1.233)	3.220(0.773)	0.177
${MN}_{1}$	IPM	0.086	0.971(0.339)	0.330(0.604)	1.750(0.657)	3.269	0.086	1.118(0.405)	0.700(0.835)	1.840(0.762)	3.355
	ADMM	0.086	0.952(0.363)	0.910(1.173)	1.600(0.696)	3.178	0.098	1.249(0.365)	1.620(1.523)	1.980(0.738)	3.230
	PPA	0.110	1.128(0.336)	0.110(0.314)	2.070(0.655)	0.202	0.110	1.283(0.392)	0.460(0.784)	2.270(0.777)	0.150
${MN}_{2}$	IPM	0.134	3.087(0.643)	3.890(2.331)	4.510(0.893)	2.683	0.125	3.371(0.602)	4.780(2.729)	4.910(0.911)	2.739
	ADMM	0.137	2.897(0.496)	7.840(3.589)	4.250(0.903)	3.432	0.134	3.197(0.477)	8.640(3.586)	4.600(0.964)	3.491
	PPA	0.158	3.161(0.681)	3.910(2.708)	4.680(0.898)	0.146	0.149	3.507(0.625)	4.710(2.467)	5.120(0.868)	0.117
Laplace	IPM	0.086	1.066(0.409)	0.380(0.708)	1.910(0.753)	3.352	0.086	1.372(0.493)	1.130(1.284)	2.350(0.903)	3.417
	ADMM	0.098	1.177(0.441)	1.350(1.591)	2.010(0.745)	3.248	0.104	1.540(0.494)	2.510(2.267)	2.510(0.904)	3.223
	PPA	0.110	1.254(0.427)	0.220(0.561)	2.350(0.783)	0.192	0.128	1.558(0.496)	0.710(0.977)	2.800(0.829)	0.157
$\sqrt{2} \times t_{4}$	IPM	0.101	1.795(0.435)	1.300(1.314)	2.940(0.789)	2.923	0.104	2.160(0.517)	2.280(1.735)	3.230(0.827)	2.980
	ADMM	0.128	1.889(0.409)	3.320(2.344)	2.920(0.813)	3.215	0.110	2.210(0.462)	5.180(3.439)	3.250(0.833)	3.345
	PPA	0.146	1.923(0.454)	1.150(1.507)	3.200(0.816)	0.166	0.152	2.261(0.547)	1.580(1.505)	3.570(0.807)	0.137
Cauchy	IPM	0.095	1.986(0.618)	1.560(1.486)	3.230(0.874)	3.267	0.113	2.498(0.734)	2.390(1.933)	3.850(1.019)	3.122
	ADMM	0.128	2.181(0.564)	4.210(2.552)	3.440(0.903)	2.870	0.116	2.417(0.587)	5.240(3.108)	3.630(1.012)	2.881
	PPA	0.158	2.357(0.700)	1.460(1.374)	3.800(0.888)	0.212	0.134	2.667(0.805)	2.650(2.167)	4.160(1.080)	0.178

Table 7. Table 7: Analysis of the microarray data by MSCRA_PPA and MSCRA_ADMM

Method	$τ$	All data		Random partition
Method	$τ$	$#$ genes	Time(s)	Ave. $#$ genes	Pre_error	Time(s)
ADMM	0.25	17	3.843	17.200(1.807)	0.050(0.009)	4.686(0.804)
	0.5	27	4.141	20.960(4.323)	0.029(0.005)	3.555(0.496)
	0.75	19	4.314	21.280(2.611)	0.040(0.005)	3.534(0.405)
PPA	0.25	20	0.208	16.440(3.721)	0.023(0.006)	0.235(0.056)
	0.5	27	0.226	20.740(4.237)	0.029(0.005)	0.247(0.136)
	0.75	17	0.181	12.500(3.032)	0.024(0.004)	0.352(0.068)

Equations215

y = X β^{*} + ε

y = X β^{*} + ε

f_{τ} (z) := n^{- 1} \sum_{i = 1}^{n} θ_{τ} (z_{i}) with θ_{τ} (u) := (τ - I_{{u \leq 0}}) u

f_{τ} (z) := n^{- 1} \sum_{i = 1}^{n} θ_{τ} (z_{i}) with θ_{τ} (u) := (τ - I_{{u \leq 0}}) u

\widehat{\beta}(\tau)\in\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}\nu f_{\tau}(y-\!X\beta)+\|\beta\|_{0}\Big{\}}

\widehat{\beta}(\tau)\in\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}\nu f_{\tau}(y-\!X\beta)+\|\beta\|_{0}\Big{\}}

int (dom ϕ) \supseteq [0, 1], t^{*} := ar g min_{0 \leq t \leq 1} ϕ (t), ϕ (t^{*}) = 0 and ϕ (1) = 1.

int (dom ϕ) \supseteq [0, 1], t^{*} := ar g min_{0 \leq t \leq 1} ϕ (t), ϕ (t^{*}) = 0 and ϕ (1) = 1.

\min_{w\in\mathbb{R}^{p}}\Big{\{}{\textstyle\sum_{i=1}^{p}}\phi(w_{i})\quad\mbox{s.t.}\ \ \langle e-w,|z|\rangle=0,\,0\leq w\leq e\Big{\}}.

\min_{w\in\mathbb{R}^{p}}\Big{\{}{\textstyle\sum_{i=1}^{p}}\phi(w_{i})\quad\mbox{s.t.}\ \ \langle e-w,|z|\rangle=0,\,0\leq w\leq e\Big{\}}.

\small\min_{\beta\in\mathbb{R}^{p},w\in\mathbb{R}^{p}}\bigg{\{}\nu f_{\tau}(y-\!X\beta)+\sum_{i=1}^{p}\phi(w_{i})\quad\mbox{s.t.}\ \ \langle e\!-w,|\beta|\rangle=0,\,0\leq w\leq e\bigg{\}}

\small\min_{\beta\in\mathbb{R}^{p},w\in\mathbb{R}^{p}}\bigg{\{}\nu f_{\tau}(y-\!X\beta)+\sum_{i=1}^{p}\phi(w_{i})\quad\mbox{s.t.}\ \ \langle e\!-w,|\beta|\rangle=0,\,0\leq w\leq e\bigg{\}}

\min_{\beta\in\mathbb{R}^{p},w\in[0,e]}\Big{\{}\nu f_{\tau}(y-\!X\beta)+\big{[}\textstyle{\sum_{i=1}^{p}}\phi(w_{i})+\rho\langle e-w,|\beta|\rangle\big{]}\Big{\}}

\min_{\beta\in\mathbb{R}^{p},w\in[0,e]}\Big{\{}\nu f_{\tau}(y-\!X\beta)+\big{[}\textstyle{\sum_{i=1}^{p}}\phi(w_{i})+\rho\langle e-w,|\beta|\rangle\big{]}\Big{\}}

\min_{\beta\in\mathbb{R}^{p}}\Big{\{}\Theta_{\nu,\rho}(\beta):=f_{\tau}(y-\!X\beta)+\nu^{-1}{\textstyle\sum_{i=1}^{p}}\big{[}\rho|\beta_{i}|-\psi^{*}(\rho|\beta_{i}|)\big{]}\Big{\}}.

\min_{\beta\in\mathbb{R}^{p}}\Big{\{}\Theta_{\nu,\rho}(\beta):=f_{\tau}(y-\!X\beta)+\nu^{-1}{\textstyle\sum_{i=1}^{p}}\big{[}\rho|\beta_{i}|-\psi^{*}(\rho|\beta_{i}|)\big{]}\Big{\}}.

\psi^{*}(s)=\left\{\begin{array}[]{cl}0&{\rm if}\ s\leq 1,\\ s-1&{\rm if}\ s>1\end{array}\right.~{}~{}~{}{\rm and}~{}~{}~{}~{}h_{\rho}(t)=\left\{\begin{array}[]{cl}\rho|t|&{\rm if}\ |t|\leq\frac{1}{\rho},\\ 1&{\rm if}\ |t|>\frac{1}{\rho}.\end{array}\right.

\psi^{*}(s)=\left\{\begin{array}[]{cl}0&{\rm if}\ s\leq 1,\\ s-1&{\rm if}\ s>1\end{array}\right.~{}~{}~{}{\rm and}~{}~{}~{}~{}h_{\rho}(t)=\left\{\begin{array}[]{cl}\rho|t|&{\rm if}\ |t|\leq\frac{1}{\rho},\\ 1&{\rm if}\ |t|>\frac{1}{\rho}.\end{array}\right.

ψ^{*} (s)

ψ^{*} (s)

h_{ρ} (t)

ψ^{*} (s)

ψ^{*} (s)

h_{ρ} (t)

\vspace{-0.3cm}\beta^{k}\approx\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}f_{\tau}(y-\!X\beta)+\lambda\,{\textstyle\sum_{i=1}^{p}}(1\!-\!w_{i}^{k-1})|\beta_{i}|\Big{\}}.

\vspace{-0.3cm}\beta^{k}\approx\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}f_{\tau}(y-\!X\beta)+\lambda\,{\textstyle\sum_{i=1}^{p}}(1\!-\!w_{i}^{k-1})|\beta_{i}|\Big{\}}.

\vspace - 0.3 c m w_{i}^{k} = arg min_{0 \leq w_{i} \leq 1} {ϕ (w_{i}) - ρ_{k} w_{i} ∣ β_{i}^{k} ∣} .

\vspace - 0.3 c m w_{i}^{k} = arg min_{0 \leq w_{i} \leq 1} {ϕ (w_{i}) - ρ_{k} w_{i} ∣ β_{i}^{k} ∣} .

w_{i}^{k}=\min\Big{[}1,\max\Big{(}0,\frac{(a+1)\rho_{k}|\beta_{i}^{k}|-2}{2(a-1)}\Big{)}\Big{]}\ \ {\rm for}\ i=1,\ldots,p.

w_{i}^{k}=\min\Big{[}1,\max\Big{(}0,\frac{(a+1)\rho_{k}|\beta_{i}^{k}|-2}{2(a-1)}\Big{)}\Big{]}\ \ {\rm for}\ i=1,\ldots,p.

i = 1 \sum p ψ^{*} (ρ ∣ β_{i} ∣) \geq i = 1 \sum p ψ^{*} (ρ ∣ β_{i}^{'} ∣) + ρ ⟨ w, ∣ β ∣ - ∣ β^{'} ∣ ⟩ \forall β \in R^{p} .

i = 1 \sum p ψ^{*} (ρ ∣ β_{i} ∣) \geq i = 1 \sum p ψ^{*} (ρ ∣ β_{i}^{'} ∣) + ρ ⟨ w, ∣ β ∣ - ∣ β^{'} ∣ ⟩ \forall β \in R^{p} .

f_{\tau}(y-\!X\beta)+\lambda\big{\|}(e-\!w^{k-1})\circ\beta\big{\|}_{1}-\lambda\big{[}\sum_{i=1}^{p}\psi^{*}(\rho|\beta_{i}^{k-1}|)+\rho\langle w^{k-1},|\beta^{k-1}|\rangle\big{]}

f_{\tau}(y-\!X\beta)+\lambda\big{\|}(e-\!w^{k-1})\circ\beta\big{\|}_{1}-\lambda\big{[}\sum_{i=1}^{p}\psi^{*}(\rho|\beta_{i}^{k-1}|)+\rho\langle w^{k-1},|\beta^{k-1}|\rangle\big{]}

δ^{k}

δ^{k}

\displaystyle=-X^{\mathbb{T}}\partial\!f_{\tau}(y\!-\!X\beta^{k})+\lambda\big{[}(1\!-\!w_{1}^{k-1})\partial|\beta_{1}^{k}|\times\cdots\times(1\!-\!w_{p}^{k-1})\partial|\beta_{p}^{k}|\big{]}

u \in \partial f_{τ} (z); ρ ∣ β_{i} ∣ \in \partial ψ (w_{i}) for i = 1, \dots, p; y - X β - z = 0;

u \in \partial f_{τ} (z); ρ ∣ β_{i} ∣ \in \partial ψ (w_{i}) for i = 1, \dots, p; y - X β - z = 0;

\displaystyle X^{\mathbb{T}}u\in\lambda\big{[}(1\!-\!w_{1})\partial|\beta_{1}|\times\cdots\times(1\!-\!w_{p})\partial|\beta_{p}|\big{]},\qquad\quad

Err_{k} := \frac{∥ Δ _{1} ∥ ^{2} + ∥ Δ _{2}^{k} ∥ ^{2} + ∥ y - X β ^{k} - z ^{k} ∥ ^{2}}{1 + ∥ y ∥} \leq tol

Err_{k} := \frac{∥ Δ _{1} ∥ ^{2} + ∥ Δ _{2}^{k} ∥ ^{2} + ∥ y - X β ^{k} - z ^{k} ∥ ^{2}}{1 + ∥ y ∥} \leq tol

h_{k} (β) := ∥ λ (e - w^{k}) \circ β ∥_{1} for β \in R^{p} .

h_{k} (β) := ∥ λ (e - w^{k}) \circ β ∥_{1} for β \in R^{p} .

\mathcal{C}(S^{*}):=\bigcup_{S^{*}\subset S,|S|\leq 1.5s^{*}}\!\Big{\{}\beta\in\mathbb{R}^{p}\!:\|\beta_{S^{c}}\|_{1}\leq 3\|\beta_{S}\|_{1}\Big{\}}.

\mathcal{C}(S^{*}):=\bigcup_{S^{*}\subset S,|S|\leq 1.5s^{*}}\!\Big{\{}\beta\in\mathbb{R}^{p}\!:\|\beta_{S^{c}}\|_{1}\leq 3\|\beta_{S}\|_{1}\Big{\}}.

κ > 0 and \frac{1}{2 n} ∥ X Δ β ∥^{2} \geq κ ∥Δ β ∥^{2} for all Δ β \in C (S^{*}) .

κ > 0 and \frac{1}{2 n} ∥ X Δ β ∥^{2} \geq κ ∥Δ β ∥^{2} for all Δ β \in C (S^{*}) .

c \geq \frac{1}{τ ^{2} κ - 27 τ ∥ X ∥ _{max} ( 2 n ^{- 1} τ ∥ X ∥ _{1} + ϵ ) s ^{*}},

c \geq \frac{1}{τ ^{2} κ - 27 τ ∥ X ∥ _{max} ( 2 n ^{- 1} τ ∥ X ∥ _{1} + ϵ ) s ^{*}},

∥ β^{k} - β^{*} ∥ \leq \frac{9 c τ λ 1.5 s ^{*}}{8} ∥ ε ∥_{\infty} .

∥ β^{k} - β^{*} ∥ \leq \frac{9 c τ λ 1.5 s ^{*}}{8} ∥ ε ∥_{\infty} .

\beta^{\rm LS}\in\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}\frac{1}{2n}\|y-\!X\beta\|^{2}+\lambda_{n}\|\beta\|_{1}\Big{\}},

\beta^{\rm LS}\in\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}\frac{1}{2n}\|y-\!X\beta\|^{2}+\lambda_{n}\|\beta\|_{1}\Big{\}},

\beta^{\rm sr}\in\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}\frac{1}{\sqrt{n}}\|y-\!X\beta\|+\frac{\lambda^{\prime}}{n}\|\beta\|_{1}\Big{\}},

\beta^{\rm sr}\in\mathop{\arg\min}_{\beta\in\mathbb{R}^{p}}\Big{\{}\frac{1}{\sqrt{n}}\|y-\!X\beta\|+\frac{\lambda^{\prime}}{n}\|\beta\|_{1}\Big{\}},

0 \leq ϵ < \frac{n τ ^{2} κ - 54 τ ^{2} s ^{*} ∥ X ∥ _{max} ∥ X ∥ _{1}}{27 n τ s ^{*}} .

0 \leq ϵ < \frac{n τ ^{2} κ - 54 τ ^{2} s ^{*} ∥ X ∥ _{max} ∥ X ∥ _{1}}{27 n τ s ^{*}} .

F^{k}:=\Big{\{}i\!:\big{|}|\beta_{i}^{k}|-|\beta_{i}^{*}|\big{|}\geq\frac{1}{\rho_{k}}\Big{\}}\ {\rm and}\ \Lambda^{k}:=\Big{\{}i\!:|\beta_{i}^{*}|\leq\frac{4a}{(a\!+\!1)\rho_{k}}\Big{\}}.

F^{k}:=\Big{\{}i\!:\big{|}|\beta_{i}^{k}|-|\beta_{i}^{*}|\big{|}\geq\frac{1}{\rho_{k}}\Big{\}}\ {\rm and}\ \Lambda^{k}:=\Big{\{}i\!:|\beta_{i}^{*}|\leq\frac{4a}{(a\!+\!1)\rho_{k}}\Big{\}}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Sparse and Compressive Sensing Techniques · Advanced Statistical Methods and Models

Full text

A proximal dual semismooth Newton method for computing zero-norm penalized QR estimator

Dongdong Zhang111([email protected]) School of Mathematics, SCUT, Guangzhou, China. Shaohua Pan222([email protected]) School of Mathematics, South China University of Technology, China. and Shujun Bi333([email protected]) School of Mathematics, South China University of Technology, China.

Abstract

This paper is concerned with the computation of the high-dimensional zero-norm penalized quantile regression estimator, defined as a global minimizer of the zero-norm penalized check loss function. To seek a desirable approximation to the estimator, we reformulate this NP-hard problem as an equivalent augmented Lipschitz optimization problem, and exploit its coupled structure to propose a multi-stage convex relaxation approach (MSCRA_PPA), each step of which solves inexactly a weighted $\ell_{1}$ -regularized check loss minimization problem with a proximal dual semismooth Newton method. Under a restricted strong convexity condition, we provide the theoretical guarantee for the MSCRA_PPA by establishing the error bound of each iterate to the true estimator and the rate of linear convergence in a statistical sense. Numerical comparisons on some synthetic and real data show that MSCRA_PPA not only has comparable even better estimation performance, but also requires much less CPU time.

Keywords: High-dimensional; Zero-norm penalized quantile regression; Variable selection; Proximal dual semismooth Newton method

1 Introduction

Sparse penalized regression has become a popular approach for high-dimensional data analysis. In the past two decades, many classes of sparse penalized regressions have been developed by imposing a suitable penalty term on the least squares loss such as the bridge penalty in [14], Lasso in [37], SCAD in [10], elastic net in [45], adaptive lasso by [46], and so on. We refer to the survey papers by [3] and [11] for the references. These penalties, as a convex surrogate (say, $\ell_{1}$ -norm) or a nonconvex approximation (say, the bridge penalty) to the zero-norm, essentially try to capture the performance of the zero-norm, first used in the best subsect selection by [6]. The sparse least squares regression approach is useful, but it only focuses on the central tendency of the conditional distribution. It is known that a certain covariate may not have significant influence on the mean value of the response but may have a strong effect on the upper quantile of the conditional distribution due to the heterogeneity of data. It is likely that a covariate has different effects at different segments of the conditional distribution. As illustrated by [19], for non-Gaussian error distributions, the least squares regression is substantially out-performed by the quantile regression (QR).

Inspired by this, many researchers recently have considered the QR introduced by [19] for high-dimensional data analysis, owing to its robustness to outliers and its ability to offer unique insights into the relation between the response variable and the covariates; see, e.g., [39, 1, 40, 41, 12, 13]. [1] focused on the theory of the $\ell_{1}$ -penalized QR and showed that this estimator is consistent at the near-oracle rate and provided the conditions under which the selected model includes the true model; [41] studied the $\ell_{1}$ -penalized least absolute derivation (LAD) regression and verified that the estimator has near oracle performance with a high probability; and [12] studied the weighted $\ell_{1}$ -penalized QR and established the model selection oracle property and the asymptotic normality for this estimator. For nonconvex penalty-type QRs, [39] under mild conditions achieved the asymptotic oracle property of the SCAD and adaptive-Lasso penalized QRs, and [40] showed that with probability approaching one, the oracle estimator is a local optimal solution to the SCAD or MCP penalized QRs of ultra-high dimensionality. We notice that the above results are all established for the asymptotic case $n\to\infty$ .

Besides the above theoretical works, there are some works concerned with the computation of (weighted) $\ell_{1}$ -penalized QR estimators which, compared to the (weighted) $\ell_{1}$ -least-squares estimator, requires more sophisticated algorithms due to the piecewise linearity of the check loss function. Although the $\ell_{1}$ -penalized QR model can be transformed into a linear program (LP) by introducing additional variables and one may use the interior point method (IPM) softwares such as SeDuMi in [34] to solve it, this is limited to the small or medium scale case; see Figure 1-2 in Section 5. Inspired by this, [38] proposed a greedy coordinate descent algorithm for the $\ell_{1}$ -penalized LAD regression, [42] proposed a semismooth Newton coordinate descent algorithm for the elastic-net penalized QR, and [18] recently developed a semi-proximal alternating direction method of multipliers (sPADMM) and a combined version of ADMM and coordinate descent method (which is actually an inexact ADMM) for solving the weighted $\ell_{1}$ -penalized QR. In addition, for nonconvex penalized QRs, [27] developed an iterative coordinate descent algorithm and established the convergence of any subsequence to a stationary point, and [13] provided a systematic study for folded concave penalized regressions, including the SCAD and MCP penalized QRs as special cases, and showed that with high probability the oracle estimator can be obtained within two iterations of the local linear approximation (LLA) approach proposed by [47]. We find that [27] and [13] did not establish the error bound of the iterates to the true solution.

This work is interested in the computation of the high-dimensional zero-norm penalized QR estimator, a global minimizer of the zero-norm regularized check loss. To seek a high-quality approximation to this estimator, we reformulate this NP-hard problem as a mathematical program with an equilibrium constraint (MPEC), and obtain an equivalent augmented Lipschitz optimization problem from the global exact penalty of the MPEC. This augmented problem not only has a favorable coupled structure but also implies an equivalent DC (difference of convex) surrogate for the zero-norm regularized check loss minimization; see Section 2. By solving the augmented Lipschitz problem in an alternating way, we propose in Section 3 an MSCRA to compute a desirable surrogate for the zero-norm penalized QR estimator. Similar to the LLA method owing to [47], the MSCRA solves in each step a weighted $\ell_{1}$ -regularized check loss minimization, but the subproblems are allowed to be solved inexactly. Under a mild restricted strong convexity condition, we provide its theoretical guarantee in Section 4 by establishing the error bound of each iterate to the true estimator and the rate of linear convergence in a statistical sense.

Motivated by the recent work [35], we also develop a proximal dual semismooth Newton method (PDSN) in Section 5 for solving the subproblems involved in the MSCRA. Different from the semismooth Newton method by [42], this is a proximal point algorithm (PPA) with the subproblems solved by applying the semismooth Newton method to their duals, rather than to a smooth approximation to the elastic-net penalized check loss minimization problem. Numerical comparisons are made on some synthetic and real data for MSCRA_PPA, MSCRA_IPM and MSCRA_ADMM, which are the MSCRA with the subproblems solved by PDSN, SeDuMi in [34] and semi-proximal ADMM in [18], respectively. We find that MSCRA_IPM and MSCRA_ADMM have very similar performance, while MSCRA_PPA not only has a comparable estimation performance with the two methods but also requires only one-fifteenth of the CPU time required by MSCRA_ADMM and MSCRA_IPM.

Throughout this paper, $I$ and $e$ denote an identity matrix and a vector of all ones, whose dimensions are known from the context. For an $x\in\mathbb{R}^{p}$ , write $|x|:=(|x_{1}|,\ldots,|x_{p}|)^{\mathbb{T}}$ and ${\rm sign}(x):=({\rm sign}(x_{1}),\ldots,{\rm sign}(x_{p}))^{\mathbb{T}}$ , and denote by $\|x\|_{1},\|x\|$ and $\|x\|_{\infty}$ the $l_{1}$ -norm, $l_{2}$ -norm and $l_{\infty}$ -norm of $x$ , respectively. For a matrix $A\in\mathbb{R}^{n\times p}$ , $\|A\|,\|A\|_{\max}$ and $\|A\|_{1}$ respectively denote the spectral norm, element-wise maximum norm, and maximum column sum norm of $A$ . For a set $S$ , $\mathbb{I}_{S}$ means the characteristic function on $S$ , i.e., $\mathbb{I}_{S}(z)=1$ if $z\in S$ , otherwise $\mathbb{I}_{S}(z)=0$ . For given $a,b\in\mathbb{R}^{p}$ with $a_{i}\leq b_{i}$ for $i=1,\ldots,p$ , $[a,b]$ means the box set. For an extended real-valued function $f\!:\mathbb{R}^{p}\to(-\infty,+\infty]$ , write ${\rm dom}\,f:=\{x\in\mathbb{R}^{p}\ |\ f(x)<\infty\}$ , and denote $\mathcal{P}_{\gamma}f$ and $e_{\gamma}f$ for a given $\gamma>0$ by the proximal mapping and Moreau envelope of $f$ , defined as $\mathcal{P}_{\gamma}f(x):=\mathop{\arg\min}_{z\in\mathbb{R}^{p}}\big{\{}f(z)+\frac{1}{2\gamma}\|z-x\|^{2}\big{\}}$ and $e_{\gamma}f(x):=\min_{z\in\mathbb{R}^{p}}\big{\{}f(z)+\frac{1}{2\gamma}\|z-x\|^{2}\big{\}}$ . In the sequel, we write $\mathcal{P}\!f$ for $\mathcal{P}_{1}f$ . When $f$ is convex, $\mathcal{P}_{\gamma}f\!:\mathbb{R}^{p}\to\mathbb{R}^{p}$ is a Lipschitz mapping with modulus $1$ , and $e_{\gamma}f$ is a smooth convex function with $\nabla e_{\gamma}f(x)=\gamma^{-1}(x-\mathcal{P}_{\gamma}f(x))$ .

2 Zero-norm penalized quantile regression and equivalent difference of convex model

Quantile regression is a popular method for studying the influence of a set of covariates on the conditional distribution of a response variable, and has been widely used to handle heteroscedasticity; see [20] and [40]. For a univariate response ${\bf Y}$ and a vector of covariates ${\bf X}\in\mathbb{R}^{p}$ , the conditional cumulative distribution function of ${\bf Y}$ is defined as $F_{\bf Y}(t|x):={\rm Pr}({\bf Y}\leq t\ |\ {\bf X}=x)$ , and the $\tau$ th conditional quantile of ${\bf Y}$ is given by $Q_{\bf Y}(\tau|x):=\inf\big{\{}t\!:F_{\bf Y}(t|x)\geq\tau\big{\}}.$ Let $X\!=[x_{1}\ \cdots\ x_{n}]^{\mathbb{T}}$ be an $n\times p$ design matrix on ${\bf X}$ . Consider the linear quantile regression

[TABLE]

where $y=(y_{1},\ldots,y_{n})^{\mathbb{T}}\in\mathbb{R}^{n}$ is the response vector, $\varepsilon=(\varepsilon_{1},\ldots,\varepsilon_{n})^{\mathbb{T}}$ is the noise vector whose components are independently distributed and satisfy ${\rm Pr}(\varepsilon_{i}\leq 0|x_{i})=\tau$ for some known constant $\tau\in(0,1)$ , and $\beta^{*}\in\mathbb{R}^{p}$ is the true but unknown coefficient vector. This quantile regression model actually assumes that $Q_{\bf Y}(\tau|x_{i})=x_{i}^{\mathbb{T}}\beta^{*}$ for $i=1,\ldots,n$ . We are interested in the high-dimensional case where $p>n$ and the sparse model in the sense that only $s^{*}(\ll p)$ components of the unknown true $\beta^{*}$ are nonzero.

For $\tau\in\!(0,1)$ , let $f_{\tau}\!:\mathbb{R}^{n}\to\mathbb{R}$ be the check loss function of (1), i.e.,

[TABLE]

which was first introduced by [19]. To estimate the unknown true $\beta^{*}$ in (1), we consider the zero-norm regularized problem

[TABLE]

where $\nu>0$ is the regularization parameter, and $\|\beta\|_{0}$ denotes the zero-norm of $\beta$ (i.e., the number of nonzero entries of $\beta$ ). By the expression of $f_{\tau}$ , $f_{\tau}$ is nonnegative and coercive (i.e., $f_{\tau}(\beta^{k})\to+\infty$ whenever $\|\beta^{k}\|\to\infty$ ). By Lemma 3 in Appendix A, the estimator $\widehat{\beta}(\tau)$ is well defined. Since $\widehat{\beta}(\tau)$ depends on $\tau$ , there is a great possibility for model (3) to monitor different “locations” of the conditional distribution, and then the heteroscedasticity of the data, when existing, can be inspected by solving (3) with different $\tau\in(0,1)$ . For the simplicity, in the sequel we use $\widehat{\beta}$ to replace $\widehat{\beta}(\tau)$ , and for a given $\tau\in(0,1)$ , write $\underline{\tau}:=\min(\tau,1\!-\!\tau)$ and $\overline{\tau}:=\max(\tau,1\!-\!\tau)$ .

Due to the combination of the zero-norm, the computation of $\widehat{\beta}$ is NP-hard. To design an algorithm in the next section for seeking a high-quality approximation to $\widehat{\beta}$ , we next derive an equivalent augmented Lipschitz optimization problem from a primal-dual viewpoint, and to demonstrate that such a mechanism provides a unified way to yield equivalent DC surrogates for the zero-norm regularized problem (3), we introduce a family of proper lsc convex functions on $\mathbb{R}$ , denoted by $\mathscr{L}$ , satisfying the conditions:

[TABLE]

With a $\phi\in\!\mathscr{L}$ , clearly, the zero-norm $\|z\|_{0}$ is the optimal value function of

[TABLE]

This characterization of zero-norm shows that model (3) is equivalent to

[TABLE]

in the following sense: if $\overline{\beta}$ is globally optimal to (3), then $(\overline{\beta}\!,{\rm sign}(|\overline{\beta}|))$ is a global optimal solution of problem (5), and conversely, if $(\overline{\beta},\overline{w})$ is a global optimal solution of (5), then $\overline{\beta}$ is globally optimal to (3). Problem (5) is a mathematical program with an equilibrium constraint $e-w\geq 0,|\beta|\geq 0$ , $\langle e-w,|\beta|\rangle=0$ (abbreviated as MPEC). The equivalence between (3) and (5) shows that the difficulty of model (3) arises from the hidden equilibrium constraint. It is well known that the handling of nonconvex constraints is much harder than that of nonconvex objective functions. Then it is natural to consider the penalized version of problem (5)

[TABLE]

where $\rho>0$ is the penalty parameter. Since $\beta\mapsto\!f_{\tau}(y-\!X\beta)$ is Lipschitz continuous, the following conclusion holds by Section 3.2 of [23].

Theorem 2.1

The problem (6) associated to each $\rho>\overline{\rho}:=\frac{\phi_{-}^{\prime}(1)(1-t^{*})\overline{\tau}\nu\|X\|}{1-t_{0}}$ has the same global optimal solution set as the MPEC (5) does, where $t^{0}$ is the minimum element in $[t^{*},1)$ such that $\frac{1}{1-t^{*}}\in\partial\phi(t_{0})$ .

Theorem 2.1 states that problem (6) is a global exact penalty of (5) in the sense that there is a threshold $\overline{\rho}>0$ such that the former associated to every $\rho>\overline{\rho}$ has the same global optimal solution set as the latter does. Together with the equivalence between (3) and (5), model (3) is equivalent to problem (6). Notice that the objective function of (6) is globally Lipschitz continuous over its feasible set and its nonconvexity is owing to the coupled term $\langle e\!-\!w,|\beta|\rangle$ rather than the combination. So, problem (6) provides an equivalent augmented Lipschitz reformulation for the zero-norm problem (3). In fact, problem (6) associated to every $\rho>\overline{\rho}$ implies an equivalent DC surrogate for (3). To illustrate this, let $\psi(t)=\phi(t)$ if $t\in[0,1]$ and otherwise $\phi(t)=+\infty$ . Then, with the conjugate $\psi^{*}(s):=\sup_{t\in\mathbb{R}}\{st-\psi(t)\}$ of $\psi$ , one may check that (6) is equivalent to

[TABLE]

Since $\psi^{*}$ is a nondecreasing finite convex function on $\mathbb{R}$ , the function $s\mapsto\psi^{*}(\rho|s|)$ is convex, and problem (7) is a DC program. To sum up the above discussions, problem (7) associated to every $\rho>\overline{\rho}$ provides an equivalent DC surrogate for (3). Moreover, $H_{\rho}(\beta):=\sum_{i=1}^{p}h_{\rho}(\beta_{i})$ with $h_{\rho}(t):=\rho|t|-\psi^{*}(\rho|t|)$ for $t\in\mathbb{R}$ is a DC surrogate for the zero-norm. To close this section, we present some examples of $\phi\in\mathscr{L}$ .

Example 2.1

Let $\phi(t)=t$ for $t\in\mathbb{R}$ . After a simple computation, we have

[TABLE]

It is immediate to see that the function $\nu^{-1}h_{\rho}(t)$ will reduce to the capped $\ell_{1}$ -function $t\mapsto\lambda\min(|t|,\alpha)$ in [44] with $\nu=\rho/\lambda$ and $\rho=\alpha^{-1}$ .

Example 2.2

Let $\phi(t):=\frac{a-1}{a+1}t^{2}+\frac{2}{a+1}t\ (a>1)$ for $t\in\mathbb{R}$ . One can calculate

[TABLE]

It is not hard to check that $\nu^{-1}h_{\rho}(t)$ will reduces to the SCAD function $\rho_{\lambda}(t)$ in [10] when $\nu=\frac{2}{(a+1)\lambda^{2}}$ and $\rho=\frac{2}{(a+1)\lambda}$ .

Example 2.3

Let $\phi(t):=\frac{a^{2}}{4}t^{2}-\frac{a^{2}}{2}t+at+\frac{(a-2)^{2}}{4}\ (a>2)$ for $t\in\mathbb{R}$ . We have

[TABLE]

The $\nu^{-1}h_{\rho}(t)$ will reduce to the MCP in [43] if $\nu=\frac{2}{a\lambda^{2}},\rho=\frac{1}{\lambda}$ .

3 Multi-stage convex relaxation approach

From the last section, to compute the estimator $\widehat{\beta}$ , we only need to solve a single penalty problem (6) that is much easier than the zero-norm problem (3) because its nonconvexity only arises from the coupled term $\langle w,|\beta|\rangle$ . Observe that (6) becomes a convex program when either of $w$ and $\beta$ is fixed. So, we solve it in an alternating way and propose the following multi-stage convex relaxation approach (MSCRA) with $\phi$ in Example 2.2.

Remark 3.1

(i)* Step 1 of Algorithm 1 is solving problem (6) with $w$ fixed to be $w^{k-1}$ , while Step 3 is solving this problem with $\beta$ fixed to be $\beta^{k}$ ; that is, Algorithm 1 is solving the nonconvex penalty problem (6) in an alternating way. In the first stage, since there is no any information on estimating the nonzero entries of $\beta^{*}$ , it is reasonable to impose an unbiased weight on each component of $\beta$ . Motivated by this, we restrict the initial $w^{0}$ in $[0,0.5e]$ , a subset of the feasible set of $w$ . When $w^{0}=0$ , the first stage is precisely the minimization of the $\ell_{1}$ -penalized check loss function. Although the threshold $\overline{\rho}$ is known when the parameter $\nu$ in (3) is given, we select a varying $\rho$ for (17) since it is just a relaxation of (6).*

(ii)* By the optimality condition of (17), $\rho_{k}|\beta_{i}^{k}|\in\partial\psi(w_{i}^{k})$ for each $i$ , which by Theorem 23.5 in [31] and (11) is equivalent to saying*

[TABLE]

Clearly, when $\rho_{k}|\beta_{i}^{k}|$ is close to [math], $(1\!-\!w_{i}^{k})$ in (18) may not equal $1$ though close to $1$ ; when $\rho_{k}|\beta_{i}^{k}|$ is very larger, $(1\!-\!w_{i}^{k})$ in (18) may not equal [math] though close to [math]. To achieve a high-quality solution with Algorithm 1, the last term of (16) implies that a smaller $(1\!-\!w_{i}^{k-1})$ but not [math] is expected for those larger $|\beta_{i}|$ , and a larger $(1\!-\!w_{i}^{k-1})$ instead of $1$ is expected for those smaller $|\beta_{i}|$ . Thus, the function $\phi$ in Example 2.2 is desirable especially for those problems whose solutions have small nonzero entries. The weight $w^{k}$ associated to the function $\phi$ in Example 2.3 has a similar performance, but the weight $w^{k}$ associated to the function $\phi$ in Example 2.1 is different since $w_{i}^{k}=0$ if $\rho_{k}|\beta_{i}^{k}|<1$ , $w_{i}^{k}=1$ if $\rho_{k}|\beta_{i}^{k}|>1$ , otherwise $w_{i}^{k}\in[0,1]$ .

(iii)* Algorithm 1 is actually an inexact majorization-minimization (MM) method (see [22]) for solving the equivalent DC surrogate (7) with a special starting point. Indeed, for a given $\beta^{\prime}\in\mathbb{R}^{p}$ , the convexity and smoothness of $\psi^{*}$ implies that with $w_{i}=(\psi^{*})^{\prime}(\rho|\beta_{i}^{\prime}|)$ for $i=1,\ldots,p$ ,*

[TABLE]

Notice that each $w_{i}\in[0,1]$ by the expression of $\psi^{*}$ . Hence, the function

[TABLE]

is a majorization of $\Theta_{\lambda,\rho}$ at $\beta^{k-1}$ and the subproblem (16) is the inexact minimization of this majorization function. Also, for any given $\rho_{0}>0$ , when $\|\beta^{0}\|_{\infty}\leq\frac{2}{(a+1)\rho_{0}}$ , we have $w_{i}^{0}=(\psi^{*})^{\prime}(\rho_{0}|\beta_{i}^{0}|)=0$ by (11). Thus, the first stage of Algorithm 1 with $w^{0}=0$ is precisely the inexact MM method for (7) with $\beta^{0}$ satisfying $\|\beta^{0}\|_{\infty}\leq\frac{2}{(a+1)\rho_{0}}$ . In addition, Algorithm 1 can be regarded as an inexact inversion of the LLA method proposed by **[47]** for (7), but it is different from the DC algorithm by **[39]** since the latter depends on the majorization of $\beta\mapsto{\textstyle\sum_{i=1}^{p}}\psi^{*}(\rho|\beta_{i}|)$ at $\beta^{k}$ and the obtained approximation is lack of symmetry.

(iv)* Considering that practical computation always involves deviation, we allow the problem in (16) to be solved inexactly with the accuracy measured in the following way: $\exists\delta^{k}\in\mathbb{R}^{p}$ and $r_{k}\geq 0$ with $\|\delta^{k}\|\leq r_{k}$ such that*

[TABLE]

where the equality is by Theorem 23.8 in **[31]**. Notice that the first-order optimality conditions of (6) take the following form

[TABLE]

where $u\in\mathbb{R}^{n}$ is the Lagrange multiplier associated to $y-X\beta-z=0$ . By Step 2 of Algorithm 1, $\rho_{k}|\beta^{k}|\in\partial\psi(w_{1}^{k})\times\cdots\times\partial\psi(w_{p}^{k})$ . In view of this, we measure the KKT residual of (6) associated to $\rho_{k}$ at $(\beta^{k},z^{k},u^{k})$ by

[TABLE]

where $\Delta_{1}^{k}:=z^{k}-\mathcal{P}\!f_{\tau}(z^{k}+u^{k})$ and $\Delta_{2}^{k}:=X^{\mathbb{T}}u^{k}-\mathcal{P}h_{k}(X^{\mathbb{T}}u^{k}+\beta^{k})$ with

[TABLE]

4 Theoretical guarantees of Algorithm 1

We denote by $S^{*}$ the support of the true vector $\beta^{*}$ , and define the set

[TABLE]

The matrix $X$ is said to have the $\kappa$ -restricted strong convexity on $\mathcal{C}(S^{*})$ if

[TABLE]

The RSC is equivalent to the restricted eigenvalue condition of the Gram matrix $\frac{1}{2n}X^{\mathbb{T}}X$ due to [16] and [4]. Notice that $\mathcal{C}(S^{*})\supseteq\big{\{}\beta\in\mathbb{R}^{p}\!:\|\beta_{(S^{*})^{c}}\|_{1}\leq 3\|\beta_{S^{*}}\|_{1}\big{\}}$ . This RSC is a little stronger than the one used by [26] for the $\ell_{1}$ -regularized smooth loss minimization. In this section, we shall provide the deterministic theoretical guarantees for Algorithm 1 under this RSC, including the error bound of the iterate $\beta^{k}$ to the true $\beta^{*}$ and the decrease analysis of the error sequence. The proofs are all included in Appendix B. We need the following assumption on the optimality tolerance $r_{k}$ of $\beta^{k}$ :

Assumption 4.1

There exists $\epsilon>0$ such that for each $k\in\mathbb{N}$ , $r_{k}\leq\epsilon$ .

First, by Lemma 7.4 in Appendix B, we have the following error bound.

Theorem 4.1

Suppose that Assumption 4.1 holds, that $X$ has the $\kappa$ -RSC over $\mathcal{C}(S^{*})$ , and that the noise vector $\varepsilon$ is nonzero. If $\rho_{3}$ and $\lambda$ are chosen such that $\rho_{3}\leq\frac{8}{9\sqrt{3}c\overline{\tau}\lambda\|\varepsilon\|_{\infty}}$ and $\lambda\in\Big{[}\frac{16\overline{\tau}\|X\|_{1}}{n}+8\epsilon,\frac{\underline{\tau}^{2}\kappa-c^{-1}-3\overline{\tau}\|X\|_{\rm max}(2n^{-1}\overline{\tau}\|X\|_{1}+\epsilon)s^{*}}{3\overline{\tau}\|X\|_{\rm max}s^{*}}\Big{]}$ for some constant

[TABLE]

then for every $k\in\mathbb{N}$

[TABLE]

Remark 4.1

(i)* For the $\ell_{1}$ -regularized least squares smooth loss estimator*

[TABLE]

the error bound $\|\beta^{\rm LS}-\beta^{*}\|=O(\sigma\sqrt{s^{*}\log p/n})$ was obtained in Corollary 2 of [26] by taking $\lambda_{n}=\sqrt{\log p/n}$ , where $\sigma>0$ represents the variance of the noise. By comparing with this error bound, the error bound in Theorem 4.1 involves the infinite norm $\|\varepsilon\|_{\infty}$ of noise $\varepsilon$ rather than its variance, and moreover, it still has the same order $O(\sqrt{s^{*}\log p/n})$ when the parameter $\lambda=O(1)$ in our model is rescaled to be $\lambda_{n}$ .

(ii)* For the following $\ell_{1}$ -regularized square-root nonsmooth loss estimator*

[TABLE]

the error bound $\|\beta^{\rm sr}\!-\!\beta^{*}\|=O\big{(}\frac{\sigma\sqrt{s^{*}}\lambda^{\prime}\varpi}{n}\big{)}$ with $\varpi\geq\frac{1}{\sqrt{n}}\|\varepsilon\|$ was achieved in Theorem 1 of **[2]** by setting $\lambda^{\prime}=O(n)$ . By considering that $f_{\tau}(y-X\beta)=O(\sqrt{n}\|y-X\beta\|)$ , the parameter $\lambda$ in our model corresponds to $\lambda^{\prime}/n$ . Thus, the error bound in Theorem 4.1 corresponds to $O(\frac{\sqrt{s^{*}}\lambda^{\prime}\|\varepsilon\|_{\infty}}{n})$ , which has the same order as $O\big{(}\frac{\sigma\sqrt{s^{*}}\lambda^{\prime}\varpi}{n}\big{)}$ since $\|\varepsilon\|_{\infty}=O(\frac{1}{\sqrt{n}}\|\varepsilon\|)$ .

(iii)* To ensure that the constant $c>0$ exists, the constant $\kappa$ needs to satisfy $\kappa>\frac{54\overline{\tau}^{2}s^{*}\|X\|_{\rm max}\|X\|_{1}}{n\underline{\tau}^{2}}$ and the inexact accuracy $\epsilon$ of $\beta^{k}$ needs to satisfy*

[TABLE]

Since $\|X\|_{1}=O(n)$ , it is necessary to solve the subproblem (16) with a very small inexact accuracy $\epsilon$ .

Theorem 4.1 establishes an error bound for every iterate $\beta^{k}$ , but it does not tell us if the error bound of the current $\beta^{k}$ is better than that of the previous $\beta^{k-1}$ . In order to seek the answer, we study the decrease of the error bound sequence by bounding $\max_{i\in S^{*}}(1-w_{i}^{k})$ . For this purpose, write $F^{0}:=S^{*}$ and $\Lambda^{0}:=\{i\!:|\beta_{i}^{*}|\leq\frac{4a}{(a+1)\rho_{0}}\}$ , and for each $k\in\mathbb{N}$ define

[TABLE]

From Lemma 7.6 in Appendix B, the value $\max_{i\in S^{*}}(1-w_{i}^{k})$ is upper bounded by

[TABLE]

By this, we have the following conclusion.

Theorem 4.2

Suppose that Assumption 4.1 holds, that $X$ has the $\kappa$ -RSC over $\mathcal{C}(S^{*})$ , and that the noise $\varepsilon$ is nonzero. If $\lambda$ is chosen as in Theorem 4.1 and the parameter $\rho_{3}$ satisfies $\rho_{3}\leq\frac{1}{c\overline{\tau}\lambda\|\varepsilon\|_{\infty}(\sqrt{4.5s^{*}}+\!\sqrt{3}/8)}$ , then for each $k\in\mathbb{N}$

[TABLE]

where we stipulate that $\sum_{j=0}^{k-2}r_{k-j}(\frac{1}{\sqrt{3}})^{j}=0$ for $k=1$ .

Remark 4.2

(i)* The error bound in (4.2) consists of the statistical error due to the noise, the identification error $\max_{i\in S^{*}}\mathbb{I}_{\Lambda^{0}}(i)$ related to the choice of $a$ and $\rho_{0}$ , and the computation errors $\sum_{j=0}^{k-2}r_{k-j}(\frac{1}{\sqrt{3}})^{j}$ and $(\frac{1}{\sqrt{3}})^{k-1}\|\beta^{1}\!-\beta^{*}\|$ . By the definition of $\Lambda^{0}$ , when $\rho_{0}$ and $a$ are such that $\frac{(a+1)\rho_{0}}{4a}>\frac{1}{\min_{i\in S^{*}}\!|\beta_{i}^{*}|}$ , the identification error becomes zero. If $\min_{i\in S^{*}}\!|\beta_{i}^{*}|$ is not too small, it would be easy to choose such $\rho_{0}$ . Clearly, when $\rho_{0}$ and $a$ are chosen to be larger, the identification error is smaller. However, when $\rho_{0}$ and $a$ are larger, $\rho_{1}$ becomes larger and each component of $w^{1}$ is close to $1$ by (18). Consequently, it will become very conservative to cut those smaller entries of $\beta^{2}$ when solving the second subproblem. Hence, there is a trade-off between the choice of $a$ and $\rho_{0}$ and the computation speed of Algorithm 1.*

(ii)* If the subproblem (16) could be solved exactly, the computation error $\sum_{j=0}^{k-2}r_{k-j}(\frac{1}{\sqrt{3}})^{j}$ vanishes. If the subproblem (16) is solved with the accuracy $r_{k}$ satisfying $r_{k}\leq(\frac{1}{\sqrt{3}})^{k}\frac{1}{k^{\nu}}$ for $\nu>1$ , this computation error will tend to [math] as $k\to+\infty$ . Since the third term on the right hand side of (4.2) is the combination of the noise and $\sum_{j=0}^{k-2}r_{k-j}(\frac{1}{\sqrt{3}})^{j}$ , it is strongly suggested that the subproblem (16) is solved as well as possible.*

For the RSC assumption in Theorem 4.1-4.2, from [30] we know that if $X$ is from the $\Sigma_{x}$ -Gaussian ensemble (i.e., $X$ is formed by independently sampling each row $x_{i}^{\mathbb{T}}\sim N(0,\Sigma_{x})$ , there exists a constant $\kappa>0$ (depending on $\Sigma_{x}$ ) such that the RSC holds on $\mathcal{C}(S^{*})$ with probability greater than $1\!-c_{1}\exp(-c_{2}n)$ as long as $n>c_{0}s^{*}\log p$ , where $c_{0},c_{1}$ and $c_{2}$ are absolutely positive constants. From [5], for some sub-Gaussian $X$ , the RSC holds on $\mathcal{C}(S^{*})$ with a high probability when $n$ is over a threshold depending on the Gaussian width of $\mathcal{C}(S^{*})$ .

5 Proximal dual semismooth Newton method

By Remark 3.1 (iv), the pivotal part of Algorithm 1 is the exact solution of

[TABLE]

where, for each $k\in\mathbb{N}$ , $h_{k}$ is the function defined in (22). In this section, we develop a proximal dual semismooth Newton method (PDSN) for (26), which is a proximal point algorithm (PPA) with the subproblems solved by applying the semismooth Newton method to their dual problems.

Remark 5.1

(i)* Since $f_{\tau}(y\!-\!X\cdot)$ and $h_{k-1}$ are convex but nondifferentiable, we follow the same line as in [35] to introduce a key proximal term $\frac{\gamma_{2,j}}{2}\|X\beta-\!X\beta^{j}\|^{2}$ except the common $\frac{\gamma_{1,j}}{2}\|\beta-\beta^{j}\|^{2}$ . As will be shown later, this provides an effective way to handle the nonsmooth $f_{\tau}(y-\!X\cdot)$ .*

(ii)* The first-order optimality conditions for (26) have the following form $u\in\partial\!f_{\tau}(z),\,X^{\mathbb{T}}u+\delta^{k}\in\partial h_{k-1}(\beta),\,y-\!X\beta-\!z=0,$ where $u\in\mathbb{R}^{n}$ is the multiplier vector associated to $y-X\beta-z=0$ . Hence, the KKT residual of problem (26) at $(\beta^{j},z^{j},u^{j})$ can be measured by*

[TABLE]

So, we suggest ${\bf Err}_{\rm PPA}^{j}\!\leq\epsilon_{\rm PPA}^{j}$ as the stopping condition of Algorithm 2.

The efficiency of Algorithm 2 depends on the solution of its subproblem, which by introducing a variable $z\in\mathbb{R}^{n}$ is equivalently written as

[TABLE]

After an elementary calculation, the dual of (5) takes the following form

[TABLE]

Since $\Psi_{k,j}$ is a smooth convex function, seeking an optimal solution of the last dual problem is equivalent to finding a root to the system

[TABLE]

Since $\mathcal{P}_{\gamma_{2,j}^{-1}}f_{\tau}$ and $\mathcal{P}_{\gamma_{1,j}^{-1}}h_{k-1}$ are strongly semismooth by Appendix A and the composition of strongly semismooth mappings is strongly semismooth by [9], the mapping $\Phi_{k,j}$ is strongly semismooth. Inspired by this, we use the semismooth Newton method to seek a root to system (28), which by [28] is expected to have a superlinear even quadratic convergence rate. By Proposition 2.3.3 and Theorem 2.6.6 of [8], the Clarke Jacobian $\partial_{C}\Phi_{k,j}(u)$ of $\Phi_{k,j}$ at $u$ is included in

[TABLE]

where (5) is due to Lemma 7.1-7.2 in Appendix A, and $\mathcal{U}_{j}(u)$ and $\mathcal{V}_{j}(u)$ are

[TABLE]

For each $U^{j}\!\in\mathcal{U}_{j}(u)$ and $V^{j}\!\in\mathcal{V}_{j}(u)$ , the matrix $\gamma_{2,j}^{-1}U^{j}+\!\gamma_{1,j}^{-1}XV^{j}X^{\mathbb{T}}$ is semidefinite, and positive definite when $\{i\ |\ \frac{\tau-1}{n\gamma}\!\leq z_{i}^{j}-\gamma_{2,j}^{-1}u_{i}\leq\!\frac{\tau}{n\gamma}\}=\emptyset$ or the matrix $X_{J}$ has full row rank with $J=\!\{i\ |\ |(\gamma_{1,j}\beta^{j}-X^{\mathbb{T}}u-\delta^{k})_{i}|>\omega_{i}^{k}\}$ . To ensure that each iterate of the semismooth Newton method works, or each element of Clarke Jacobian $\partial_{C}\Phi_{k,j}(u)$ is nonsingular, we add a small positive definite perturbation $\mu I$ to $\gamma_{2,j}^{-1}U^{j}+\!\gamma_{1,j}^{-1}XV^{j}X^{\mathbb{T}}$ . The detailed iterates of the semismooth Newton method is provided in Appendix C.

6 Numerical experiments

We shall test the performance of Algorithm 1 with the subproblems solved by PDSN, SeDuMi and sPADMM, respectively, on synthetic and real data, and call the three solvers MSCRA_PPA, MSCRA_IPM and MSCRA_ADMM, respectively. Among others, SeDuMi is solving the equivalent LP of (16):

[TABLE]

and the iterates of sPADMM are described in Appendix C. All numerical results are computed by a laptop computer running on 64-bit Windows System with an Intel(R) Core(TM) i7-8565 CPU 1.8GHz and 8 GB RAM.

For SeDuMi, we adopt the default setting, and for sPADMM we choose the step-size $\varrho=1.618$ and the initial $\sigma=1$ , and adopt the stopping criterion in Appendix C with $j_{\rm max}=3000$ and $\epsilon_{\rm ADMM}=10^{-6}$ . For PDSN, we choose $\underline{\gamma}=10^{-8},\varrho=5/7$ and $\gamma_{1,0}=\gamma_{2,0}=\min(0.1,R_{0})$ where $R_{0}$ is the relative KKT residual at the initial $(\beta^{0},z^{0},u^{0})$ , and adopt the stopping criterion in Remark 5.1(ii) with $\epsilon_{\rm PPA}^{j+1}=\max(10^{-8},0.1\epsilon_{\rm PPA}^{j})$ for $\epsilon_{\rm PPA}^{0}\!=10^{-6}$ and the stopping rule $\frac{\|\Phi_{k,j}(u^{l})\|}{1+\|y\|}\leq 0.1\epsilon_{\rm PPA}^{j}$ for Algorithm 1 in Appendix C.

For MSCRA_IPM, MSCRA_ADMM and MSCRA_PPA, we use $w^{0}=0$ , and terminate them at $\beta^{k}$ when $k>10$ , or $N_{\rm nz}(\beta^{k})=\cdots=N_{\rm nz}(\beta^{k-3})$ and ${\bf Err}_{k}\leq 10^{-5}$ , or $N_{\rm nz}(\beta^{k})=\cdots=N_{\rm nz}(\beta^{k-2})$ and $|{\bf Err}_{k}-{\bf Err}_{k-2}|\leq 10^{-6}$ , where $N_{\rm nz}(\beta^{k})\!:=\!\sum_{i=1}^{p}\mathbb{I}\big{\{}|\beta_{i}^{k}|>\!10^{-6}\max(1,\|\beta^{k}\|_{\infty})\big{\}}$ denotes the number of nonzero entries of $\beta^{k}$ , and ${\bf Err}_{k}$ is the KKT residual at the $k$ th step defined in (21). We update $\rho_{k}$ by $\rho_{1}=\max\big{(}1,\frac{1}{3\|\beta^{1}\|_{\infty}}\big{)}$ and $\rho_{k}=\min\big{(}\frac{5}{4}\rho_{k-1},\frac{10^{8}}{\|\beta^{k}\|_{\infty}}\big{)}$ for $k=2,3$ . In addition, during the implementation of three solvers, we run SeDuMi, sPADMM and PSDN to solve the $k$ th subproblem with the optimal solution of the $(k\!-\!1)$ th subproblem yielded by them as the starting point. When $k=1$ , we choose $\beta^{0}=0$ to be the starting point of MSCRA_IPM and MSCRA_ADMM, and use $\beta^{0}=0$ to run Algorithm 2.

6.1. Comparisons of three solvers for the subproblem

We make numerical comparisons among SeDuMi, sPADMM and PDSN by applying them to the problem (16) for $k=1$ , i.e., the $\ell_{1}$ -regularized check loss minimization problem. Inspired by the work owing to [18], we consider the simulation model $y_{i}=x_{i}^{\mathbb{T}}\beta^{*}+\kappa\varepsilon_{i}$ for $i=1,\ldots,n$ in [15] to generate data, where $x_{i}^{\mathbb{T}}\sim N(0,\Sigma)$ for $i=1,\ldots,n$ with $\Sigma=(\alpha+(1-\!\alpha)\mathbb{I}_{\{i=j\}})_{p\times p},\beta_{j}^{*}=\!(-1)^{j}\exp(-\frac{2j-1}{20})$ , $\varepsilon\sim N(0,\Sigma)$ , and $\kappa$ is chosen such that the signal-noise ratio of the data is $3.0$ . We focus on the high-dimensional situation with $(p,n)=(5000,500)$ and $\alpha=0$ and $0.95$ . Figure 1-2 show the optimal values yielded by three solvers and their CPU time (in seconds) on solving (16) with $k=1$ and the same sequence of $50$ values of $\lambda$ . By the results in Section 4, we select the $50$ values of $\lambda$ by

[TABLE]

for $i=1,2,\ldots,50$ , where $\gamma_{\rm min}=0.02$ , and $\gamma_{\rm max}=0.25$ and $0.38$ respectively for $\alpha=0$ and $0.95$ . Such $\gamma_{\rm max}$ is such that $N_{\rm nz}(\beta^{f})$ attains the value [math], where $\beta^{f}$ represents the final output of a solver.

Figure 1 shows that the three solvers yield comparable optimal values, and the optimal values given by PDSN are a little better than those given by SeDuMi and sPADMM. Figure 2 shows that PDSN requires much less CPU time than SeDuMi and sPADMM do, and for $\alpha=0.95$ the CPU time of the former is on average about $0.03$ and $0.09$ times that of SeDuMi and sPADMM, respectively, but for $\alpha=0,\tau=0.5$ , when $\lambda<\lambda_{3}$ , PDSN requires more CPU time since the Clarke Jacobians are close to singularity. This shows that if the parameter $\lambda$ in the model is not too small (a common setting for sparsity), PDSN is superior to SeDuMi and sPADMM in terms of the optimal value and CPU time. We find that sPADMM always attains the maximum number of iterations $3000$ for all test problems (it even attains the maximum number of iterations if $j_{\rm max}=10000$ ). Since $j_{\rm max}=3000$ is used here, its CPU time is less than that of SeDuMi.

6.2. Numerical performance of Algorithm 1

We first apply MSCRA_PPA to the example in Section 3.1 of [40], i.e., solve (6) with $\nu=\lambda^{-1}$ for $\lambda=\max(0.01,0.1\|X\|_{1}/n)$ , for which the scalar response is generated according to the heteroscedastic location-scale model $Y=X_{6}+X_{12}+X_{15}+X_{20}+0.7X_{1}\varepsilon$ , where $\varepsilon\sim N(0,1)$ is independent of the covariates. Table 1 reports its identification performance for $\tau=0.3,0.5$ and $0.7$ under different sample size, where Size, AE, $P_{1}$ and $P_{2}$ have the same meaning as in [40]. We see that, for $\tau=0.5$ , $P_{2}$ always equals [math]. So, the check loss with $\tau=0.5$ can not identify $X_{1}$ , but the check loss with $\tau=0.3$ and $0.7$ can identify $X_{1}$ and the proportion of identifying $X_{1}$ increases as $n$ becomes large.

Next we use a synthetic example to show that MSCRA_PPA can solve efficiently a series of zero-norm regularized problems (3) with different $\tau$ but a fixed $\lambda$ . We generate an i.i.d. standard normal random vector $\beta_{S^{*}}^{*}$ with $s^{*}=\lfloor 0.5\sqrt{p}\rfloor$ entries of $S^{*}$ chosen randomly from $\{1,\ldots,p\}$ for $p=15000$ , and then obtain the response vector $y$ from model (1), where $x_{i}^{\mathbb{T}}\sim N(0,\Sigma)$ for $i=1,\ldots,n$ with $\Sigma=0.6E+0.4I$ and $n=\lfloor 2s^{*}\log p\rfloor$ , and the noise $\varepsilon_{i}$ is from the Laplace distribution with density $d(u)=0.5\exp(-|u|)$ . Here, $E$ is a $p\times p$ matrix of all ones. Figure 3 describes the average absolute $\ell_{2}$ -error $\|\widehat{\beta}^{f}\!-\!\beta^{*}\|$ and time when applying MSCRA_PPA to $10$ test problems for $\tau\in\{0.05,0.1,0.15,\ldots,0.95\}$ with $\nu=\lambda^{-1}$ and $\lambda=37.5/n$ . We see that MSCRA_PPA yields better $\ell_{2}$ -errors for $\tau$ close to $0.5$ , and worse $\ell_{2}$ -errors for $\tau$ close to [math] or $1$ . So, for this class of noises, the check loss with $\tau$ close to $0.5$ is suitable. The MSCRA_PPA yields a desired solution for all test problems in $40$ seconds, and the CPU time for $\tau$ close to [math] or $1$ is about $1.5$ times that of $\tau$ close to $0.5$ . This means that it is an efficient solver for a series of zero-norm regularized problems in (3).

7 Conclusions

We have proposed a multi-stage convex relaxation approach, MSCRA_PPA, for computing a desirable approximation to the zero-norm penalized QR, which is defined as a global minimizer of an NP-hard problem. Under the common RSC condition and a mild restriction on the noises, we established the error bound of every iterate to the true estimator and the linear rate of convergence of the iterate sequence in a statistical sense. Numerical comparisons with MSCRA_IPM and MSCRA_ADMM show that MSCRA_PPA yields a comparable estimation performance within much less time.

Supplementary Materials

The online supplementary material consists of five parts. Appendix A includes some preliminary knowledge on generalized subdifferentials and Clarke Jacobian, and some lemmas used in Section 2-5; Appendix B includes the proof of Theorem 4.1 and Theorem 4.2; Appendix C introduces the semismooth Newton method and the semi-proximal ADMM in [17]; Appendix D includes performance comparisons of MSCRA_IPM, MSCRA_ADMM and MSCRA_PPA on some synthetic data and real data.

Acknowledgements

The authors would like to give their sincere thanks to two anonymous reviewers for their helpful comments. The authors would like to express their sincere thanks to Professor Kim-Chuan Toh from National University of Singapore for giving them some help on the implementation of Algorithm 2 when he visited SCUT. This work is supported by the National Natural Science Foundation of China under project No. 11971177.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Belloni and V. Chernozhukov , ℓ 1 subscript ℓ 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models , The Annals of Statistics, 39(2011): 82-130.
2[2] A. Belloni, V. Chernozhukov and L. Wang , Square-root lasso: pivotal recovery of sparse signals via conic programming , Biometrika, 4(2011): 791-806.
3[3] P. Bickel and B. Li , Regularization in Statistics , Sociedad de Estadística e Investigación Operativa Test, 15(2006): 271-344.
4[4] P. Bickel, Y. Ritov and A. Tsybakov , Simultaneous analysis of lasso and dantzig selector , The Annals of Statistics, 37(2009): 1705-1732.
5[5] A. Banerjee, S. Chen, F. Fazayeli and V. Sivakumar , Estimation with norm regularization , Advances in Neural Information Processing Systems, 2(2015): 1556-1564.
6[6] L. Breiman , Heuristics of instability and stabilization in model selection , The Annals of Statistics, 24(1996): 2350-2383.
7[7] A. P. Chiang , Homozygosity mapping with SNP arrays identifies Trim 32, an e 3 Ubiquitin Ligase, as a Bardet-Biedl Syndrome Gene (BBS 11) , Proceedings of the National Academy of Sciences, (2006)103, 6287-6292. [328]
8[8] F. H. Clarke , Nonsmooth Analysis and Optimization , Wiley, New York, 1983.