A proximal dual semismooth Newton method for computing zero-norm penalized QR estimator
Dongdong Zhang, Shaohua Pan, Shujun Bi

TL;DR
This paper introduces a novel multi-stage convex relaxation method using a proximal dual semismooth Newton approach to efficiently compute high-dimensional zero-norm penalized quantile regression estimators, with theoretical guarantees and superior empirical performance.
Contribution
It develops a new multi-stage convex relaxation algorithm with a proximal dual semismooth Newton method for zero-norm penalized QR, providing theoretical error bounds and convergence analysis.
Findings
Achieves linear convergence rate under restricted strong convexity.
Outperforms existing methods in estimation accuracy and computational efficiency.
Demonstrates effectiveness on synthetic and real datasets.
Abstract
This paper is concerned with the computation of the high-dimensional zero-norm penalized quantile regression estimator, defined as a global minimizer of the zero-norm penalized check loss function. To seek a desirable approximation to the estimator, we reformulate this NP-hard problem as an equivalent augmented Lipschitz optimization problem, and exploit its coupled structure to propose a multi-stage convex relaxation approach (MSCRA\_PPA), each step of which solves inexactly a weighted -regularized check loss minimization problem with a proximal dual semismooth Newton method. Under a restricted strong convexity condition, we provide the theoretical guarantee for the MSCRA\_PPA by establishing the error bound of each iterate to the true estimator and the rate of linear convergence in a statistical sense. Numerical comparisons on some synthetic and real data show that MSCRA\_PPA…
| Size | 11.800(4.369) | 9.320(3.146) | 6.290(1.472) | 5.330(0.697) | |
| 0.81 | 0.83 | 0.93 | 0.91 | ||
| 0.81 | 0.83 | 0.93 | 0.91 | ||
| AE | 0.197(0.174) | 0.170(0.165) | 0.176(0.155) | 0.145(0.127) | |
| Size | 10.960(3.075) | 7.910(2.060) | 5.270(1.171) | 4.370(0.597) | |
| 1.00 | 1.00 | 1.00 | 1.00 | ||
| 0.00 | 0.00 | 0.00 | 0.00 | ||
| AE | 0.034(0.014) | 0.027(0.011) | 0.021(0.010) | 0.018(0.008) | |
| Size | 12.590(4.356) | 8.320(2.169) | 6.310(1.308) | 5.380(0.693) | |
| 0.79 | 0.88 | 0.91 | 0.93 | ||
| 0.79 | 0.88 | 0.91 | 0.93 | ||
| AE | 0.183(0.175) | 0.220(0.180) | 0.151(0.146) | 0.162(0.142) |
| Method | -error | FP | FN | Time(s) | -error | FP | FN | Time(s) | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| IPM | 0.104 | 0.444(0.107) | 5.100(2.057) | 0.730(0.468) | 4.221 | 0.110 | 0.523(0.157) | 7.840(3.034) | 0.670(0.514) | 5.613 | |
| ADMM | 0.104 | 0.446(0.106) | 5.100(2.028) | 0.730(0.468) | 3.033 | 0.110 | 0.523(0.158) | 7.760(3.079) | 0.670(0.514) | 3.847 | |
| PPA | 0.116 | 0.446(0.119) | 1.920(1.228) | 0.800(0.426) | 0.138 | 0.119 | 0.557(0.188) | 3.810(1.937) | 0.840(0.420) | 0.202 | |
| IPM | 0.104 | 0.345(0.066) | 5.030(2.007) | 0.410(0.494) | 3.566 | 0.110 | 0.377(0.078) | 6.860(2.741) | 0.490(0.502) | 4.168 | |
| ADMM | 0.104 | 0.345(0.067) | 5.150(2.110) | 0.410(0.494) | 2.601 | 0.110 | 0.377(0.078) | 6.890(2.723) | 0.480(0.502) | 3.062 | |
| PPA | 0.110 | 0.347(0.066) | 3.260(1.779) | 0.510(0.502) | 0.131 | 0.116 | 0.375(0.061) | 5.050(2.333) | 0.590(0.494) | 0.191 | |
| IPM | 0.104 | 1.425(0.361) | 6.750(2.955) | 1.860(0.921) | 5.558 | 0.122 | 1.764(0.501) | 4.220(2.377) | 2.660(1.085) | 5.568 | |
| ADMM | 0.104 | 1.427(0.356) | 6.760(3.114) | 1.880(0.902) | 3.829 | 0.122 | 1.749(0.512) | 4.270(2.432) | 2.670(1.064) | 3.825 | |
| PPA | 0.116 | 1.347(0.343) | 2.480(1.823) | 2.320(0.994) | 0.133 | 0.134 | 1.742(0.537) | 1.790(1.690) | 3.260(1.050) | 0.151 | |
| Laplace | IPM | 0.098 | 0.324(0.071) | 7.410(2.775) | 0.220(0.416) | 3.835 | 0.110 | 0.364(0.089) | 6.550(2.484) | 0.410(0.494) | 3.789 |
| ADMM | 0.098 | 0.324(0.070) | 7.450(2.797) | 0.220(0.416) | 2.709 | 0.110 | 0.365(0.089) | 6.580(2.458) | 0.400(0.492) | 2.761 | |
| PPA | 0.104 | 0.326(0.073) | 4.700(2.209) | 0.280(0.451) | 0.144 | 0.116 | 0.382(0.094) | 4.970(2.158) | 0.480(0.502) | 0.204 | |
| IPM | 0.104 | 0.487(0.139) | 5.330(2.301) | 0.760(0.474) | 4.677 | 0.110 | 0.649(0.238) | 7.300(2.880) | 0.840(0.507) | 4.907 | |
| ADMM | 0.104 | 0.487(0.138) | 5.360(2.325) | 0.760(0.474) | 3.214 | 0.110 | 0.647(0.239) | 7.360(2.812) | 0.840(0.507) | 3.340 | |
| PPA | 0.110 | 0.502(0.180) | 3.160(1.587) | 0.790(0.478) | 0.157 | 0.122 | 0.684(0.286) | 2.970(1.861) | 1.010(0.643) | 0.239 | |
| Cauchy | IPM | 0.098 | 0.536(0.217) | 8.340(3.019) | 0.670(0.533) | 4.954 | 0.110 | 0.730(0.364) | 6.740(2.493) | 1.000(0.765) | 5.488 |
| ADMM | 0.098 | 0.531(0.216) | 8.340(2.879) | 0.680(0.530) | 2.989 | 0.110 | 0.729(0.360) | 6.720(2.551) | 1.010(0.759) | 3.404 | |
| PPA | 0.116 | 0.560(0.274) | 1.780(1.203) | 0.910(0.637) | 0.166 | 0.125 | 0.816(0.381) | 2.760(1.837) | 1.280(0.792) | 0.243 | |
| Method | -error | FP | FN | Time(s) | -error | FP | FN | Time(s) | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| IPM | 0.104 | 0.467(0.119) | 4.650(2.148) | 0.710(0.456) | 3.744 | 0.110 | 0.609(0.222) | 6.830(2.843) | 0.800(0.512) | 4.312 | |
| ADMM | 0.104 | 0.474(0.120) | 4.620(2.112) | 0.730(0.446) | 2.553 | 0.110 | 0.606(0.214) | 6.860(2.853) | 0.800(0.512) | 3.143 | |
| PPA | 0.110 | 0.491(0.145) | 2.810(1.594) | 0.760(0.474) | 0.133 | 0.122 | 0.591(0.199) | 3.020(1.664) | 0.870(0.442) | 0.201 | |
| IPM | 0.098 | 0.365(0.074) | 7.020(2.515) | 0.410(0.494) | 3.661 | 0.110 | 0.399(0.076) | 6.450(2.679) | 0.570(0.498) | 3.729 | |
| ADMM | 0.098 | 0.367(0.073) | 7.070(2.536) | 0.400(0.492) | 2.746 | 0.110 | 0.399(0.076) | 6.500(2.676) | 0.570(0.498) | 2.819 | |
| PPA | 0.098 | 0.366(0.073) | 7.060(2.566) | 0.410(0.494) | 0.139 | 0.122 | 0.423(0.127) | 3.390(1.959) | 0.630(0.485) | 0.180 | |
| IPM | 0.104 | 1.383(0.394) | 4.990(2.472) | 2.060(0.930) | 5.168 | 0.122 | 1.665(0.434) | 3.640(2.013) | 2.610(0.920) | 5.339 | |
| ADMM | 0.104 | 1.379(0.384) | 5.220(2.747) | 2.010(0.937) | 3.446 | 0.122 | 1.679(0.420) | 3.670(2.080) | 2.590(0.911) | 3.764 | |
| PPA | 0.119 | 1.365(0.420) | 1.590(1.436) | 2.490(0.937) | 0.101 | 0.131 | 1.705(0.512) | 2.100(1.755) | 3.010(0.959) | 0.167 | |
| Laplace | IPM | 0.098 | 0.349(0.089) | 7.250(2.564) | 0.360(0.482) | 3.818 | 0.110 | 0.381(0.099) | 6.320(2.624) | 0.580(0.496) | 4.513 |
| ADMM | 0.098 | 0.349(0.089) | 7.250(2.591) | 0.360(0.482) | 2.851 | 0.110 | 0.381(0.099) | 6.380(2.666) | 0.570(0.498) | 3.130 | |
| PPA | 0.104 | 0.352(0.088) | 4.600(2.079) | 0.410(0.494) | 0.125 | 0.116 | 0.408(0.154) | 4.610(2.188) | 0.480(0.522) | 0.209 | |
| IPM | 0.104 | 0.534(0.165) | 4.580(2.142) | 0.830(0.473) | 4.341 | 0.110 | 0.734(0.291) | 6.920(2.990) | 1.070(0.573) | 5.785 | |
| ADMM | 0.104 | 0.533(0.165) | 4.590(2.109) | 0.830(0.473) | 3.179 | 0.110 | 0.736(0.288) | 6.860(3.052) | 1.070(0.573) | 3.891 | |
| PPA | 0.110 | 0.542(0.180) | 3.020(1.723) | 0.860(0.472) | 0.129 | 0.122 | 0.710(0.283) | 3.240(1.782) | 1.150(0.575) | 0.209 | |
| Cauchy | IPM | 0.101 | 0.544(0.245) | 6.130(2.232) | 0.820(0.539) | 4.912 | 0.104 | 0.695(0.343) | 9.450(3.105) | 0.980(0.681) | 5.948 |
| ADMM | 0.104 | 0.538(0.258) | 4.890(2.136) | 0.860(0.513) | 2.952 | 0.104 | 0.693(0.335) | 9.530(2.883) | 0.950(0.672) | 3.686 | |
| PPA | 0.116 | 0.561(0.280) | 1.740(1.292) | 0.980(0.603) | 0.169 | 0.122 | 0.879(0.473) | 3.270(1.814) | 1.430(0.956) | 0.233 | |
| Method | -error | FP | FN | Time(s) | -error | FP | FN | Time(s) | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| IPM | 0.095 | 0.852(0.361) | 7.050(2.504) | 1.260(0.733) | 4.117 | 0.098 | 0.986(0.408) | 10.740(3.852) | 1.400(0.804) | 6.170 | |
| ADMM | 0.092 | 0.835(0.336) | 8.800(2.723) | 1.240(0.698) | 3.306 | 0.098 | 0.996(0.404) | 10.940(3.961) | 1.400(0.816) | 4.721 | |
| PPA | 0.110 | 0.910(0.404) | 2.390(1.550) | 1.520(0.731) | 0.111 | 0.110 | 0.965(0.387) | 5.140(2.454) | 1.440(0.701) | 0.193 | |
| IPM | 0.098 | 0.530(0.208) | 5.300(2.368) | 0.780(0.504) | 3.683 | 0.098 | 0.622(0.254) | 9.510(4.036) | 0.850(0.557) | 5.205 | |
| ADMM | 0.092 | 0.519(0.184) | 8.460(2.844) | 0.770(0.489) | 2.933 | 0.098 | 0.625(0.261) | 9.630(4.099) | 0.850(0.557) | 3.851 | |
| PPA | 0.104 | 0.550(0.227) | 3.550(1.977) | 0.800(0.512) | 0.132 | 0.110 | 0.644(0.321) | 5.120(2.363) | 1.000(0.682) | 0.184 | |
| IPM | 0.104 | 1.742(0.616) | 4.350(2.086) | 2.590(0.889) | 4.362 | 0.122 | 2.113(0.641) | 3.120(1.981) | 3.020(0.995) | 5.187 | |
| ADMM | 0.104 | 1.713(0.642) | 4.560(2.203) | 2.500(0.959) | 3.187 | 0.116 | 2.139(0.629) | 4.230(2.155) | 2.970(0.958) | 4.269 | |
| PPA | 0.140 | 1.809(0.649) | 0.820(0.936) | 2.920(0.929) | 0.085 | 0.152 | 2.125(0.721) | 0.940(0.886) | 3.290(0.868) | 0.126 | |
| Laplace | IPM | 0.098 | 0.520(0.257) | 5.810(2.639) | 0.720(0.637) | 3.767 | 0.104 | 0.650(0.375) | 6.980(3.291) | 0.980(0.710) | 3.990 |
| ADMM | 0.098 | 0.510(0.242) | 5.880(2.626) | 0.710(0.608) | 2.864 | 0.104 | 0.645(0.370) | 7.140(3.333) | 0.970(0.703) | 3.180 | |
| PPA | 0.104 | 0.543(0.267) | 3.780(2.177) | 0.840(0.615) | 0.124 | 0.116 | 0.679(0.386) | 3.710(2.176) | 1.150(0.716) | 0.167 | |
| IPM | 0.095 | 0.955(0.412) | 7.180(2.754) | 1.470(0.658) | 4.517 | 0.098 | 1.135(0.465) | 10.250(4.029) | 1.660(0.831) | 5.201 | |
| ADMM | 0.092 | 0.934(0.407) | 8.700(3.125) | 1.410(0.653) | 3.236 | 0.098 | 1.135(0.485) | 10.400(3.929) | 1.660(0.867) | 3.641 | |
| PPA | 0.110 | 1.009(0.400) | 2.570(1.736) | 1.630(0.646) | 0.118 | 0.110 | 1.190(0.542) | 5.450(2.516) | 1.870(0.939) | 0.194 | |
| Cauchy | IPM | 0.104 | 0.891(0.452) | 3.440(2.134) | 1.420(0.684) | 3.853 | 0.110 | 1.168(0.573) | 4.970(2.676) | 1.790(0.946) | 4.842 |
| ADMM | 0.098 | 0.850(0.435) | 5.590(2.586) | 1.320(0.723) | 2.672 | 0.110 | 1.153(0.549) | 4.950(2.668) | 1.770(0.908) | 2.901 | |
| PPA | 0.116 | 0.962(0.452) | 1.380(1.237) | 1.570(0.700) | 0.157 | 0.122 | 1.138(0.570) | 2.920(1.895) | 1.800(0.921) | 0.205 | |
| Method | -error | FP | FN | Time(s) | -error | FP | FN | Time(s) | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| IPM | 0.092 | 0.683(0.266) | 1.710(1.597) | 1.130(0.464) | 3.819 | 0.092 | 0.943(0.366) | 3.810(2.759) | 1.340(0.685) | 4.533 | |
| ADMM | 0.092 | 0.700(0.272) | 1.750(1.459) | 1.140(0.472) | 3.336 | 0.098 | 0.962(0.388) | 2.780(2.245) | 1.450(0.757) | 3.761 | |
| PPA | 0.104 | 0.744(0.282) | 0.650(0.880) | 1.260(0.543) | 0.195 | 0.116 | 0.934(0.347) | 1.020(1.163) | 1.580(0.684) | 0.227 | |
| IPM | 0.092 | 0.437(0.093) | 1.300(1.243) | 0.810(0.394) | 3.366 | 0.098 | 0.505(0.157) | 2.070(1.816) | 0.840(0.368) | 3.687 | |
| ADMM | 0.098 | 0.441(0.097) | 0.730(0.777) | 0.820(0.386) | 2.981 | 0.098 | 0.506(0.148) | 2.030(1.702) | 0.840(0.368) | 3.475 | |
| PPA | 0.104 | 0.448(0.107) | 0.350(0.557) | 0.930(0.293) | 0.178 | 0.116 | 0.523(0.192) | 0.420(0.867) | 1.020(0.200) | 0.235 | |
| IPM | 0.110 | 1.919(0.526) | 2.320(1.999) | 3.090(0.877) | 3.447 | 0.122 | 2.253(0.492) | 2.690(1.813) | 3.550(0.744) | 3.224 | |
| ADMM | 0.122 | 1.977(0.490) | 3.210(2.271) | 3.100(0.882) | 3.088 | 0.143 | 2.268(0.451) | 3.800(2.094) | 3.530(0.745) | 3.241 | |
| PPA | 0.152 | 2.016(0.545) | 1.650(1.480) | 3.410(0.866) | 0.117 | 0.155 | 2.444(0.579) | 2.600(1.717) | 3.830(0.842) | 0.170 | |
| Laplace | IPM | 0.086 | 0.445(0.140) | 2.390(2.117) | 0.810(0.394) | 3.926 | 0.098 | 0.568(0.253) | 2.290(2.027) | 1.010(0.414) | 3.868 |
| ADMM | 0.086 | 0.445(0.139) | 2.520(2.134) | 0.800(0.402) | 3.773 | 0.092 | 0.559(0.212) | 3.480(2.552) | 0.920(0.442) | 3.889 | |
| PPA | 0.098 | 0.469(0.167) | 0.930(1.380) | 0.910(0.379) | 0.181 | 0.104 | 0.586(0.279) | 1.570(2.171) | 1.110(0.510) | 0.250 | |
| IPM | 0.092 | 0.874(0.352) | 1.960(1.780) | 1.400(0.651) | 4.345 | 0.092 | 1.206(0.486) | 4.150(2.724) | 1.710(0.868) | 4.657 | |
| ADMM | 0.086 | 0.905(0.339) | 3.600(2.229) | 1.310(0.598) | 4.071 | 0.095 | 1.259(0.448) | 3.760(2.527) | 1.800(0.791) | 3.875 | |
| PPA | 0.110 | 0.966(0.347) | 0.910(1.215) | 1.610(0.680) | 0.165 | 0.116 | 1.172(0.429) | 1.290(1.241) | 1.980(0.816) | 0.216 | |
| Cauchy | IPM | 0.086 | 0.803(0.377) | 3.050(2.208) | 1.330(0.620) | 5.123 | 0.092 | 1.239(0.575) | 3.910(2.016) | 1.900(0.859) | 5.142 |
| ADMM | 0.092 | 0.896(0.436) | 2.270(1.869) | 1.480(0.674) | 3.599 | 0.095 | 1.392(0.592) | 4.190(2.608) | 2.040(0.887) | 3.471 | |
| PPA | 0.101 | 0.880(0.415) | 1.200(1.198) | 1.460(0.658) | 0.278 | 0.113 | 1.237(0.502) | 1.470(1.540) | 2.030(0.834) | 0.333 | |
| Method | -error | FP | FN | Time(s) | -error | FP | FN | Time(s) | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| IPM | 0.092 | 1.572(0.411) | 1.020(1.263) | 2.630(0.761) | 2.879 | 0.098 | 1.803(0.469) | 1.480(1.337) | 2.890(0.840) | 2.907 | |
| ADMM | 0.131 | 1.683(0.365) | 2.050(1.617) | 2.820(0.796) | 2.979 | 0.116 | 1.923(0.462) | 3.050(2.057) | 2.950(0.903) | 3.077 | |
| PPA | 0.140 | 1.709(0.423) | 0.650(1.029) | 3.010(0.759) | 0.229 | 0.140 | 1.939(0.460) | 1.210(1.233) | 3.220(0.773) | 0.177 | |
| IPM | 0.086 | 0.971(0.339) | 0.330(0.604) | 1.750(0.657) | 3.269 | 0.086 | 1.118(0.405) | 0.700(0.835) | 1.840(0.762) | 3.355 | |
| ADMM | 0.086 | 0.952(0.363) | 0.910(1.173) | 1.600(0.696) | 3.178 | 0.098 | 1.249(0.365) | 1.620(1.523) | 1.980(0.738) | 3.230 | |
| PPA | 0.110 | 1.128(0.336) | 0.110(0.314) | 2.070(0.655) | 0.202 | 0.110 | 1.283(0.392) | 0.460(0.784) | 2.270(0.777) | 0.150 | |
| IPM | 0.134 | 3.087(0.643) | 3.890(2.331) | 4.510(0.893) | 2.683 | 0.125 | 3.371(0.602) | 4.780(2.729) | 4.910(0.911) | 2.739 | |
| ADMM | 0.137 | 2.897(0.496) | 7.840(3.589) | 4.250(0.903) | 3.432 | 0.134 | 3.197(0.477) | 8.640(3.586) | 4.600(0.964) | 3.491 | |
| PPA | 0.158 | 3.161(0.681) | 3.910(2.708) | 4.680(0.898) | 0.146 | 0.149 | 3.507(0.625) | 4.710(2.467) | 5.120(0.868) | 0.117 | |
| Laplace | IPM | 0.086 | 1.066(0.409) | 0.380(0.708) | 1.910(0.753) | 3.352 | 0.086 | 1.372(0.493) | 1.130(1.284) | 2.350(0.903) | 3.417 |
| ADMM | 0.098 | 1.177(0.441) | 1.350(1.591) | 2.010(0.745) | 3.248 | 0.104 | 1.540(0.494) | 2.510(2.267) | 2.510(0.904) | 3.223 | |
| PPA | 0.110 | 1.254(0.427) | 0.220(0.561) | 2.350(0.783) | 0.192 | 0.128 | 1.558(0.496) | 0.710(0.977) | 2.800(0.829) | 0.157 | |
| IPM | 0.101 | 1.795(0.435) | 1.300(1.314) | 2.940(0.789) | 2.923 | 0.104 | 2.160(0.517) | 2.280(1.735) | 3.230(0.827) | 2.980 | |
| ADMM | 0.128 | 1.889(0.409) | 3.320(2.344) | 2.920(0.813) | 3.215 | 0.110 | 2.210(0.462) | 5.180(3.439) | 3.250(0.833) | 3.345 | |
| PPA | 0.146 | 1.923(0.454) | 1.150(1.507) | 3.200(0.816) | 0.166 | 0.152 | 2.261(0.547) | 1.580(1.505) | 3.570(0.807) | 0.137 | |
| Cauchy | IPM | 0.095 | 1.986(0.618) | 1.560(1.486) | 3.230(0.874) | 3.267 | 0.113 | 2.498(0.734) | 2.390(1.933) | 3.850(1.019) | 3.122 |
| ADMM | 0.128 | 2.181(0.564) | 4.210(2.552) | 3.440(0.903) | 2.870 | 0.116 | 2.417(0.587) | 5.240(3.108) | 3.630(1.012) | 2.881 | |
| PPA | 0.158 | 2.357(0.700) | 1.460(1.374) | 3.800(0.888) | 0.212 | 0.134 | 2.667(0.805) | 2.650(2.167) | 4.160(1.080) | 0.178 | |
| Method | All data | Random partition | ||||
|---|---|---|---|---|---|---|
| genes | Time(s) | Ave.genes | Pre_error | Time(s) | ||
| ADMM | 0.25 | 17 | 3.843 | 17.200(1.807) | 0.050(0.009) | 4.686(0.804) |
| 0.5 | 27 | 4.141 | 20.960(4.323) | 0.029(0.005) | 3.555(0.496) | |
| 0.75 | 19 | 4.314 | 21.280(2.611) | 0.040(0.005) | 3.534(0.405) | |
| PPA | 0.25 | 20 | 0.208 | 16.440(3.721) | 0.023(0.006) | 0.235(0.056) |
| 0.5 | 27 | 0.226 | 20.740(4.237) | 0.029(0.005) | 0.247(0.136) | |
| 0.75 | 17 | 0.181 | 12.500(3.032) | 0.024(0.004) | 0.352(0.068) | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Sparse and Compressive Sensing Techniques · Advanced Statistical Methods and Models
A proximal dual semismooth Newton method for computing zero-norm penalized QR estimator
Dongdong Zhang111([email protected]) School of Mathematics, SCUT, Guangzhou, China. Shaohua Pan222([email protected]) School of Mathematics, South China University of Technology, China. and Shujun Bi333([email protected]) School of Mathematics, South China University of Technology, China.
Abstract
This paper is concerned with the computation of the high-dimensional zero-norm penalized quantile regression estimator, defined as a global minimizer of the zero-norm penalized check loss function. To seek a desirable approximation to the estimator, we reformulate this NP-hard problem as an equivalent augmented Lipschitz optimization problem, and exploit its coupled structure to propose a multi-stage convex relaxation approach (MSCRA_PPA), each step of which solves inexactly a weighted -regularized check loss minimization problem with a proximal dual semismooth Newton method. Under a restricted strong convexity condition, we provide the theoretical guarantee for the MSCRA_PPA by establishing the error bound of each iterate to the true estimator and the rate of linear convergence in a statistical sense. Numerical comparisons on some synthetic and real data show that MSCRA_PPA not only has comparable even better estimation performance, but also requires much less CPU time.
Keywords: High-dimensional; Zero-norm penalized quantile regression; Variable selection; Proximal dual semismooth Newton method
1 Introduction
Sparse penalized regression has become a popular approach for high-dimensional data analysis. In the past two decades, many classes of sparse penalized regressions have been developed by imposing a suitable penalty term on the least squares loss such as the bridge penalty in [14], Lasso in [37], SCAD in [10], elastic net in [45], adaptive lasso by [46], and so on. We refer to the survey papers by [3] and [11] for the references. These penalties, as a convex surrogate (say, -norm) or a nonconvex approximation (say, the bridge penalty) to the zero-norm, essentially try to capture the performance of the zero-norm, first used in the best subsect selection by [6]. The sparse least squares regression approach is useful, but it only focuses on the central tendency of the conditional distribution. It is known that a certain covariate may not have significant influence on the mean value of the response but may have a strong effect on the upper quantile of the conditional distribution due to the heterogeneity of data. It is likely that a covariate has different effects at different segments of the conditional distribution. As illustrated by [19], for non-Gaussian error distributions, the least squares regression is substantially out-performed by the quantile regression (QR).
Inspired by this, many researchers recently have considered the QR introduced by [19] for high-dimensional data analysis, owing to its robustness to outliers and its ability to offer unique insights into the relation between the response variable and the covariates; see, e.g., [39, 1, 40, 41, 12, 13]. [1] focused on the theory of the -penalized QR and showed that this estimator is consistent at the near-oracle rate and provided the conditions under which the selected model includes the true model; [41] studied the -penalized least absolute derivation (LAD) regression and verified that the estimator has near oracle performance with a high probability; and [12] studied the weighted -penalized QR and established the model selection oracle property and the asymptotic normality for this estimator. For nonconvex penalty-type QRs, [39] under mild conditions achieved the asymptotic oracle property of the SCAD and adaptive-Lasso penalized QRs, and [40] showed that with probability approaching one, the oracle estimator is a local optimal solution to the SCAD or MCP penalized QRs of ultra-high dimensionality. We notice that the above results are all established for the asymptotic case .
Besides the above theoretical works, there are some works concerned with the computation of (weighted) -penalized QR estimators which, compared to the (weighted) -least-squares estimator, requires more sophisticated algorithms due to the piecewise linearity of the check loss function. Although the -penalized QR model can be transformed into a linear program (LP) by introducing additional variables and one may use the interior point method (IPM) softwares such as SeDuMi in [34] to solve it, this is limited to the small or medium scale case; see Figure 1-2 in Section 5. Inspired by this, [38] proposed a greedy coordinate descent algorithm for the -penalized LAD regression, [42] proposed a semismooth Newton coordinate descent algorithm for the elastic-net penalized QR, and [18] recently developed a semi-proximal alternating direction method of multipliers (sPADMM) and a combined version of ADMM and coordinate descent method (which is actually an inexact ADMM) for solving the weighted -penalized QR. In addition, for nonconvex penalized QRs, [27] developed an iterative coordinate descent algorithm and established the convergence of any subsequence to a stationary point, and [13] provided a systematic study for folded concave penalized regressions, including the SCAD and MCP penalized QRs as special cases, and showed that with high probability the oracle estimator can be obtained within two iterations of the local linear approximation (LLA) approach proposed by [47]. We find that [27] and [13] did not establish the error bound of the iterates to the true solution.
This work is interested in the computation of the high-dimensional zero-norm penalized QR estimator, a global minimizer of the zero-norm regularized check loss. To seek a high-quality approximation to this estimator, we reformulate this NP-hard problem as a mathematical program with an equilibrium constraint (MPEC), and obtain an equivalent augmented Lipschitz optimization problem from the global exact penalty of the MPEC. This augmented problem not only has a favorable coupled structure but also implies an equivalent DC (difference of convex) surrogate for the zero-norm regularized check loss minimization; see Section 2. By solving the augmented Lipschitz problem in an alternating way, we propose in Section 3 an MSCRA to compute a desirable surrogate for the zero-norm penalized QR estimator. Similar to the LLA method owing to [47], the MSCRA solves in each step a weighted -regularized check loss minimization, but the subproblems are allowed to be solved inexactly. Under a mild restricted strong convexity condition, we provide its theoretical guarantee in Section 4 by establishing the error bound of each iterate to the true estimator and the rate of linear convergence in a statistical sense.
Motivated by the recent work [35], we also develop a proximal dual semismooth Newton method (PDSN) in Section 5 for solving the subproblems involved in the MSCRA. Different from the semismooth Newton method by [42], this is a proximal point algorithm (PPA) with the subproblems solved by applying the semismooth Newton method to their duals, rather than to a smooth approximation to the elastic-net penalized check loss minimization problem. Numerical comparisons are made on some synthetic and real data for MSCRA_PPA, MSCRA_IPM and MSCRA_ADMM, which are the MSCRA with the subproblems solved by PDSN, SeDuMi in [34] and semi-proximal ADMM in [18], respectively. We find that MSCRA_IPM and MSCRA_ADMM have very similar performance, while MSCRA_PPA not only has a comparable estimation performance with the two methods but also requires only one-fifteenth of the CPU time required by MSCRA_ADMM and MSCRA_IPM.
Throughout this paper, and denote an identity matrix and a vector of all ones, whose dimensions are known from the context. For an , write and , and denote by and the -norm, -norm and -norm of , respectively. For a matrix , and respectively denote the spectral norm, element-wise maximum norm, and maximum column sum norm of . For a set , means the characteristic function on , i.e., if , otherwise . For given with for , means the box set. For an extended real-valued function , write , and denote and for a given by the proximal mapping and Moreau envelope of , defined as \mathcal{P}_{\gamma}f(x):=\mathop{\arg\min}_{z\in\mathbb{R}^{p}}\big{\{}f(z)+\frac{1}{2\gamma}\|z-x\|^{2}\big{\}} and e_{\gamma}f(x):=\min_{z\in\mathbb{R}^{p}}\big{\{}f(z)+\frac{1}{2\gamma}\|z-x\|^{2}\big{\}}. In the sequel, we write for . When is convex, is a Lipschitz mapping with modulus , and is a smooth convex function with .
2 Zero-norm penalized quantile regression and equivalent difference of convex model
Quantile regression is a popular method for studying the influence of a set of covariates on the conditional distribution of a response variable, and has been widely used to handle heteroscedasticity; see [20] and [40]. For a univariate response and a vector of covariates , the conditional cumulative distribution function of is defined as , and the th conditional quantile of is given by Q_{\bf Y}(\tau|x):=\inf\big{\{}t\!:F_{\bf Y}(t|x)\geq\tau\big{\}}. Let be an design matrix on . Consider the linear quantile regression
[TABLE]
where is the response vector, is the noise vector whose components are independently distributed and satisfy for some known constant , and is the true but unknown coefficient vector. This quantile regression model actually assumes that for . We are interested in the high-dimensional case where and the sparse model in the sense that only components of the unknown true are nonzero.
For , let be the check loss function of (1), i.e.,
[TABLE]
which was first introduced by [19]. To estimate the unknown true in (1), we consider the zero-norm regularized problem
[TABLE]
where is the regularization parameter, and denotes the zero-norm of (i.e., the number of nonzero entries of ). By the expression of , is nonnegative and coercive (i.e., whenever ). By Lemma 3 in Appendix A, the estimator is well defined. Since depends on , there is a great possibility for model (3) to monitor different “locations” of the conditional distribution, and then the heteroscedasticity of the data, when existing, can be inspected by solving (3) with different . For the simplicity, in the sequel we use to replace , and for a given , write and .
Due to the combination of the zero-norm, the computation of is NP-hard. To design an algorithm in the next section for seeking a high-quality approximation to , we next derive an equivalent augmented Lipschitz optimization problem from a primal-dual viewpoint, and to demonstrate that such a mechanism provides a unified way to yield equivalent DC surrogates for the zero-norm regularized problem (3), we introduce a family of proper lsc convex functions on , denoted by , satisfying the conditions:
[TABLE]
With a , clearly, the zero-norm is the optimal value function of
[TABLE]
This characterization of zero-norm shows that model (3) is equivalent to
[TABLE]
in the following sense: if is globally optimal to (3), then is a global optimal solution of problem (5), and conversely, if is a global optimal solution of (5), then is globally optimal to (3). Problem (5) is a mathematical program with an equilibrium constraint , (abbreviated as MPEC). The equivalence between (3) and (5) shows that the difficulty of model (3) arises from the hidden equilibrium constraint. It is well known that the handling of nonconvex constraints is much harder than that of nonconvex objective functions. Then it is natural to consider the penalized version of problem (5)
[TABLE]
where is the penalty parameter. Since is Lipschitz continuous, the following conclusion holds by Section 3.2 of [23].
Theorem 2.1
The problem (6) associated to each has the same global optimal solution set as the MPEC (5) does, where is the minimum element in such that .
Theorem 2.1 states that problem (6) is a global exact penalty of (5) in the sense that there is a threshold such that the former associated to every has the same global optimal solution set as the latter does. Together with the equivalence between (3) and (5), model (3) is equivalent to problem (6). Notice that the objective function of (6) is globally Lipschitz continuous over its feasible set and its nonconvexity is owing to the coupled term rather than the combination. So, problem (6) provides an equivalent augmented Lipschitz reformulation for the zero-norm problem (3). In fact, problem (6) associated to every implies an equivalent DC surrogate for (3). To illustrate this, let if and otherwise . Then, with the conjugate of , one may check that (6) is equivalent to
[TABLE]
Since is a nondecreasing finite convex function on , the function is convex, and problem (7) is a DC program. To sum up the above discussions, problem (7) associated to every provides an equivalent DC surrogate for (3). Moreover, with for is a DC surrogate for the zero-norm. To close this section, we present some examples of .
Example 2.1
Let for . After a simple computation, we have
[TABLE]
It is immediate to see that the function will reduce to the capped -function in [44] with and .
Example 2.2
Let for . One can calculate
[TABLE]
It is not hard to check that will reduces to the SCAD function in [10] when and .
Example 2.3
Let for . We have
[TABLE]
The will reduce to the MCP in [43] if .
3 Multi-stage convex relaxation approach
From the last section, to compute the estimator , we only need to solve a single penalty problem (6) that is much easier than the zero-norm problem (3) because its nonconvexity only arises from the coupled term . Observe that (6) becomes a convex program when either of and is fixed. So, we solve it in an alternating way and propose the following multi-stage convex relaxation approach (MSCRA) with in Example 2.2.
Remark 3.1
(i)* Step 1 of Algorithm 1 is solving problem (6) with fixed to be , while Step 3 is solving this problem with fixed to be ; that is, Algorithm 1 is solving the nonconvex penalty problem (6) in an alternating way. In the first stage, since there is no any information on estimating the nonzero entries of , it is reasonable to impose an unbiased weight on each component of . Motivated by this, we restrict the initial in , a subset of the feasible set of . When , the first stage is precisely the minimization of the -penalized check loss function. Although the threshold is known when the parameter in (3) is given, we select a varying for (17) since it is just a relaxation of (6).*
(ii)* By the optimality condition of (17), for each , which by Theorem 23.5 in [31] and (11) is equivalent to saying*
[TABLE]
Clearly, when is close to [math], in (18) may not equal though close to ; when is very larger, in (18) may not equal [math] though close to [math]. To achieve a high-quality solution with Algorithm 1, the last term of (16) implies that a smaller but not [math] is expected for those larger , and a larger instead of is expected for those smaller . Thus, the function in Example 2.2 is desirable especially for those problems whose solutions have small nonzero entries. The weight associated to the function in Example 2.3 has a similar performance, but the weight associated to the function in Example 2.1 is different since if , if , otherwise .
(iii)* Algorithm 1 is actually an inexact majorization-minimization (MM) method (see [22]) for solving the equivalent DC surrogate (7) with a special starting point. Indeed, for a given , the convexity and smoothness of implies that with for ,*
[TABLE]
Notice that each by the expression of . Hence, the function
[TABLE]
is a majorization of at and the subproblem (16) is the inexact minimization of this majorization function. Also, for any given , when , we have by (11). Thus, the first stage of Algorithm 1 with is precisely the inexact MM method for (7) with satisfying . In addition, Algorithm 1 can be regarded as an inexact inversion of the LLA method proposed by **[47]** for (7), but it is different from the DC algorithm by **[39]** since the latter depends on the majorization of at and the obtained approximation is lack of symmetry.
(iv)* Considering that practical computation always involves deviation, we allow the problem in (16) to be solved inexactly with the accuracy measured in the following way: and with such that*
[TABLE]
where the equality is by Theorem 23.8 in **[31]**. Notice that the first-order optimality conditions of (6) take the following form
[TABLE]
where is the Lagrange multiplier associated to . By Step 2 of Algorithm 1, . In view of this, we measure the KKT residual of (6) associated to at by
[TABLE]
where and with
[TABLE]
4 Theoretical guarantees of Algorithm 1
We denote by the support of the true vector , and define the set
[TABLE]
The matrix is said to have the -restricted strong convexity on if
[TABLE]
The RSC is equivalent to the restricted eigenvalue condition of the Gram matrix due to [16] and [4]. Notice that \mathcal{C}(S^{*})\supseteq\big{\{}\beta\in\mathbb{R}^{p}\!:\|\beta_{(S^{*})^{c}}\|_{1}\leq 3\|\beta_{S^{*}}\|_{1}\big{\}}. This RSC is a little stronger than the one used by [26] for the -regularized smooth loss minimization. In this section, we shall provide the deterministic theoretical guarantees for Algorithm 1 under this RSC, including the error bound of the iterate to the true and the decrease analysis of the error sequence. The proofs are all included in Appendix B. We need the following assumption on the optimality tolerance of :
Assumption 4.1
There exists such that for each , .
First, by Lemma 7.4 in Appendix B, we have the following error bound.
Theorem 4.1
Suppose that Assumption 4.1 holds, that has the -RSC over , and that the noise vector is nonzero. If and are chosen such that and \lambda\in\Big{[}\frac{16\overline{\tau}\|X\|_{1}}{n}+8\epsilon,\frac{\underline{\tau}^{2}\kappa-c^{-1}-3\overline{\tau}\|X\|_{\rm max}(2n^{-1}\overline{\tau}\|X\|_{1}+\epsilon)s^{*}}{3\overline{\tau}\|X\|_{\rm max}s^{*}}\Big{]} for some constant
[TABLE]
then for every
[TABLE]
Remark 4.1
(i)* For the -regularized least squares smooth loss estimator*
[TABLE]
the error bound was obtained in Corollary 2 of [26] by taking , where represents the variance of the noise. By comparing with this error bound, the error bound in Theorem 4.1 involves the infinite norm of noise rather than its variance, and moreover, it still has the same order when the parameter in our model is rescaled to be .
(ii)* For the following -regularized square-root nonsmooth loss estimator*
[TABLE]
the error bound \|\beta^{\rm sr}\!-\!\beta^{*}\|=O\big{(}\frac{\sigma\sqrt{s^{*}}\lambda^{\prime}\varpi}{n}\big{)} with was achieved in Theorem 1 of **[2]** by setting . By considering that , the parameter in our model corresponds to . Thus, the error bound in Theorem 4.1 corresponds to , which has the same order as O\big{(}\frac{\sigma\sqrt{s^{*}}\lambda^{\prime}\varpi}{n}\big{)} since .
(iii)* To ensure that the constant exists, the constant needs to satisfy and the inexact accuracy of needs to satisfy*
[TABLE]
Since , it is necessary to solve the subproblem (16) with a very small inexact accuracy .
Theorem 4.1 establishes an error bound for every iterate , but it does not tell us if the error bound of the current is better than that of the previous . In order to seek the answer, we study the decrease of the error bound sequence by bounding . For this purpose, write and , and for each define
[TABLE]
From Lemma 7.6 in Appendix B, the value is upper bounded by
[TABLE]
By this, we have the following conclusion.
Theorem 4.2
Suppose that Assumption 4.1 holds, that has the -RSC over , and that the noise is nonzero. If is chosen as in Theorem 4.1 and the parameter satisfies , then for each
[TABLE]
where we stipulate that for .
Remark 4.2
(i)* The error bound in (4.2) consists of the statistical error due to the noise, the identification error related to the choice of and , and the computation errors and . By the definition of , when and are such that , the identification error becomes zero. If is not too small, it would be easy to choose such . Clearly, when and are chosen to be larger, the identification error is smaller. However, when and are larger, becomes larger and each component of is close to by (18). Consequently, it will become very conservative to cut those smaller entries of when solving the second subproblem. Hence, there is a trade-off between the choice of and and the computation speed of Algorithm 1.*
(ii)* If the subproblem (16) could be solved exactly, the computation error vanishes. If the subproblem (16) is solved with the accuracy satisfying for , this computation error will tend to [math] as . Since the third term on the right hand side of (4.2) is the combination of the noise and , it is strongly suggested that the subproblem (16) is solved as well as possible.*
For the RSC assumption in Theorem 4.1-4.2, from [30] we know that if is from the -Gaussian ensemble (i.e., is formed by independently sampling each row , there exists a constant (depending on ) such that the RSC holds on with probability greater than as long as , where and are absolutely positive constants. From [5], for some sub-Gaussian , the RSC holds on with a high probability when is over a threshold depending on the Gaussian width of .
5 Proximal dual semismooth Newton method
By Remark 3.1 (iv), the pivotal part of Algorithm 1 is the exact solution of
[TABLE]
where, for each , is the function defined in (22). In this section, we develop a proximal dual semismooth Newton method (PDSN) for (26), which is a proximal point algorithm (PPA) with the subproblems solved by applying the semismooth Newton method to their dual problems.
Remark 5.1
(i)* Since and are convex but nondifferentiable, we follow the same line as in [35] to introduce a key proximal term except the common . As will be shown later, this provides an effective way to handle the nonsmooth .*
(ii)* The first-order optimality conditions for (26) have the following form where is the multiplier vector associated to . Hence, the KKT residual of problem (26) at can be measured by*
[TABLE]
So, we suggest as the stopping condition of Algorithm 2.
The efficiency of Algorithm 2 depends on the solution of its subproblem, which by introducing a variable is equivalently written as
[TABLE]
After an elementary calculation, the dual of (5) takes the following form
[TABLE]
Since is a smooth convex function, seeking an optimal solution of the last dual problem is equivalent to finding a root to the system
[TABLE]
Since and are strongly semismooth by Appendix A and the composition of strongly semismooth mappings is strongly semismooth by [9], the mapping is strongly semismooth. Inspired by this, we use the semismooth Newton method to seek a root to system (28), which by [28] is expected to have a superlinear even quadratic convergence rate. By Proposition 2.3.3 and Theorem 2.6.6 of [8], the Clarke Jacobian of at is included in
[TABLE]
where (5) is due to Lemma 7.1-7.2 in Appendix A, and and are
[TABLE]
For each and , the matrix is semidefinite, and positive definite when or the matrix has full row rank with . To ensure that each iterate of the semismooth Newton method works, or each element of Clarke Jacobian is nonsingular, we add a small positive definite perturbation to . The detailed iterates of the semismooth Newton method is provided in Appendix C.
6 Numerical experiments
We shall test the performance of Algorithm 1 with the subproblems solved by PDSN, SeDuMi and sPADMM, respectively, on synthetic and real data, and call the three solvers MSCRA_PPA, MSCRA_IPM and MSCRA_ADMM, respectively. Among others, SeDuMi is solving the equivalent LP of (16):
[TABLE]
and the iterates of sPADMM are described in Appendix C. All numerical results are computed by a laptop computer running on 64-bit Windows System with an Intel(R) Core(TM) i7-8565 CPU 1.8GHz and 8 GB RAM.
For SeDuMi, we adopt the default setting, and for sPADMM we choose the step-size and the initial , and adopt the stopping criterion in Appendix C with and . For PDSN, we choose and where is the relative KKT residual at the initial , and adopt the stopping criterion in Remark 5.1(ii) with for and the stopping rule for Algorithm 1 in Appendix C.
For MSCRA_IPM, MSCRA_ADMM and MSCRA_PPA, we use , and terminate them at when , or and , or and , where N_{\rm nz}(\beta^{k})\!:=\!\sum_{i=1}^{p}\mathbb{I}\big{\{}|\beta_{i}^{k}|>\!10^{-6}\max(1,\|\beta^{k}\|_{\infty})\big{\}} denotes the number of nonzero entries of , and is the KKT residual at the th step defined in (21). We update by \rho_{1}=\max\big{(}1,\frac{1}{3\|\beta^{1}\|_{\infty}}\big{)} and \rho_{k}=\min\big{(}\frac{5}{4}\rho_{k-1},\frac{10^{8}}{\|\beta^{k}\|_{\infty}}\big{)} for . In addition, during the implementation of three solvers, we run SeDuMi, sPADMM and PSDN to solve the th subproblem with the optimal solution of the th subproblem yielded by them as the starting point. When , we choose to be the starting point of MSCRA_IPM and MSCRA_ADMM, and use to run Algorithm 2.
6.1. Comparisons of three solvers for the subproblem
We make numerical comparisons among SeDuMi, sPADMM and PDSN by applying them to the problem (16) for , i.e., the -regularized check loss minimization problem. Inspired by the work owing to [18], we consider the simulation model for in [15] to generate data, where for with , , and is chosen such that the signal-noise ratio of the data is . We focus on the high-dimensional situation with and and . Figure 1-2 show the optimal values yielded by three solvers and their CPU time (in seconds) on solving (16) with and the same sequence of values of . By the results in Section 4, we select the values of by
[TABLE]
for , where , and and respectively for and . Such is such that attains the value [math], where represents the final output of a solver.
Figure 1 shows that the three solvers yield comparable optimal values, and the optimal values given by PDSN are a little better than those given by SeDuMi and sPADMM. Figure 2 shows that PDSN requires much less CPU time than SeDuMi and sPADMM do, and for the CPU time of the former is on average about and times that of SeDuMi and sPADMM, respectively, but for , when , PDSN requires more CPU time since the Clarke Jacobians are close to singularity. This shows that if the parameter in the model is not too small (a common setting for sparsity), PDSN is superior to SeDuMi and sPADMM in terms of the optimal value and CPU time. We find that sPADMM always attains the maximum number of iterations for all test problems (it even attains the maximum number of iterations if ). Since is used here, its CPU time is less than that of SeDuMi.
6.2. Numerical performance of Algorithm 1
We first apply MSCRA_PPA to the example in Section 3.1 of [40], i.e., solve (6) with for , for which the scalar response is generated according to the heteroscedastic location-scale model , where is independent of the covariates. Table 1 reports its identification performance for and under different sample size, where Size, AE, and have the same meaning as in [40]. We see that, for , always equals [math]. So, the check loss with can not identify , but the check loss with and can identify and the proportion of identifying increases as becomes large.
Next we use a synthetic example to show that MSCRA_PPA can solve efficiently a series of zero-norm regularized problems (3) with different but a fixed . We generate an i.i.d. standard normal random vector with entries of chosen randomly from for , and then obtain the response vector from model (1), where for with and , and the noise is from the Laplace distribution with density . Here, is a matrix of all ones. Figure 3 describes the average absolute -error and time when applying MSCRA_PPA to test problems for with and . We see that MSCRA_PPA yields better -errors for close to , and worse -errors for close to [math] or . So, for this class of noises, the check loss with close to is suitable. The MSCRA_PPA yields a desired solution for all test problems in seconds, and the CPU time for close to [math] or is about times that of close to . This means that it is an efficient solver for a series of zero-norm regularized problems in (3).
7 Conclusions
We have proposed a multi-stage convex relaxation approach, MSCRA_PPA, for computing a desirable approximation to the zero-norm penalized QR, which is defined as a global minimizer of an NP-hard problem. Under the common RSC condition and a mild restriction on the noises, we established the error bound of every iterate to the true estimator and the linear rate of convergence of the iterate sequence in a statistical sense. Numerical comparisons with MSCRA_IPM and MSCRA_ADMM show that MSCRA_PPA yields a comparable estimation performance within much less time.
Supplementary Materials
The online supplementary material consists of five parts. Appendix A includes some preliminary knowledge on generalized subdifferentials and Clarke Jacobian, and some lemmas used in Section 2-5; Appendix B includes the proof of Theorem 4.1 and Theorem 4.2; Appendix C introduces the semismooth Newton method and the semi-proximal ADMM in [17]; Appendix D includes performance comparisons of MSCRA_IPM, MSCRA_ADMM and MSCRA_PPA on some synthetic data and real data.
Acknowledgements
The authors would like to give their sincere thanks to two anonymous reviewers for their helpful comments. The authors would like to express their sincere thanks to Professor Kim-Chuan Toh from National University of Singapore for giving them some help on the implementation of Algorithm 2 when he visited SCUT. This work is supported by the National Natural Science Foundation of China under project No. 11971177.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Belloni and V. Chernozhukov , ℓ 1 subscript ℓ 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models , The Annals of Statistics, 39(2011): 82-130.
- 2[2] A. Belloni, V. Chernozhukov and L. Wang , Square-root lasso: pivotal recovery of sparse signals via conic programming , Biometrika, 4(2011): 791-806.
- 3[3] P. Bickel and B. Li , Regularization in Statistics , Sociedad de Estadística e Investigación Operativa Test, 15(2006): 271-344.
- 4[4] P. Bickel, Y. Ritov and A. Tsybakov , Simultaneous analysis of lasso and dantzig selector , The Annals of Statistics, 37(2009): 1705-1732.
- 5[5] A. Banerjee, S. Chen, F. Fazayeli and V. Sivakumar , Estimation with norm regularization , Advances in Neural Information Processing Systems, 2(2015): 1556-1564.
- 6[6] L. Breiman , Heuristics of instability and stabilization in model selection , The Annals of Statistics, 24(1996): 2350-2383.
- 7[7] A. P. Chiang , Homozygosity mapping with SNP arrays identifies Trim 32, an e 3 Ubiquitin Ligase, as a Bardet-Biedl Syndrome Gene (BBS 11) , Proceedings of the National Academy of Sciences, (2006)103, 6287-6292. [328]
- 8[8] F. H. Clarke , Nonsmooth Analysis and Optimization , Wiley, New York, 1983.
