Zeroth-Order Stochastic Alternating Direction Method of Multipliers for   Nonconvex Nonsmooth Optimization

Feihu Huang; Shangqian Gao; Songcan Chen; Heng Huang

arXiv:1905.12729·math.OC·July 31, 2019

Zeroth-Order Stochastic Alternating Direction Method of Multipliers for Nonconvex Nonsmooth Optimization

Feihu Huang, Shangqian Gao, Songcan Chen, Heng Huang

PDF

Open Access

TL;DR

This paper introduces fast zeroth-order stochastic ADMM algorithms for nonconvex nonsmooth optimization, achieving optimal convergence rates and demonstrating effectiveness in complex machine learning tasks like black-box attacks.

Contribution

It proposes novel zeroth-order stochastic ADMM methods for nonconvex problems with nonsmooth penalties, extending ADMM applicability to gradient-free scenarios.

Findings

01

Achieve $O(1/T)$ convergence rate for nonconvex optimization.

02

Effectively solve complex machine learning problems with multiple penalties.

03

Validated through experiments on black-box classification and adversarial attacks.

Abstract

Alternating direction method of multipliers (ADMM) is a popular optimization tool for the composite and constrained problems in machine learning. However, in many machine learning problems such as black-box attacks and bandit feedback, ADMM could fail because the explicit gradients of these problems are difficult or infeasible to obtain. Zeroth-order (gradient-free) methods can effectively solve these problems due to that the objective function values are only required in the optimization. Recently, though there exist a few zeroth-order ADMM methods, they build on the convexity of objective function. Clearly, these existing zeroth-order methods are limited in many applications. In the paper, thus, we propose a class of fast zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) for solving nonconvex problems with multiple nonsmooth penalties, based on the coordinate…

Figures7

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Convergence properties comparison of the zeroth-order ADMM algorithms and other ones. C, NC, S, NS and mNS are the abbreviations of convex, non-convex, smooth, non-smooth and the sum of multiple non-smooth functions, respectively. T 𝑇 T is the whole iteration number. Gaussian Smoothing Gradient Estimator (GauSGE), Uniform Smoothing Gradient Estimator (UniSGE) and Coordinate Smoothing Gradient Estimator (CooSGE).

Algorithm	Reference	Gradient Estimator	Problem	Convergence Rate
ZOO-ADMM	Liu et al. (2018a)	GauSGE	C(S) + C(NS)	$O (\sqrt{1 / T})$
ZO-GADM	Gao et al. (2018)	UniSGE	C(S) + C(NS)	$O (\sqrt{1 / T})$
RSPGF	Ghadimi et al. (2016)	GauSGE	NC(S) + C(NS)	$O (\sqrt{1 / T})$
ZO-ProxSVRG	Huang et al. (2019b)	CooSGE	NC(S) + C(NS)	$O (1 / T)$
ZO-ProxSAGA	Huang et al. (2019b)	CooSGE	NC(S) + C(NS)	$O (1 / T)$
ZO-SVRG-ADMM	Ours	CooSGE	NC(S) + C(mNS)	$O (1 / T)$
ZO-SAGA-ADMM	Ours	CooSGE	NC(S) + C(mNS)	$O (1 / T)$

Table 2. Table 2: Real Datasets for Black-Box Binary Classification

datasets	#samples	#features	#classes
20news	16,242	100	2
a9a	32,561	123	2
w8a	64,700	300	2
covtype.binary	581,012	54	2

Equations383

x, {y_{j}}_{j = 1}^{k} min

x, {y_{j}}_{j = 1}^{k} min

A x + j = 1 \sum k B_{j} y_{j} = c,

\displaystyle\mathbb{E}\big{[}\mbox{dist}(0,\partial L(x^{*},y_{[k]}^{*},\lambda^{*}))^{2}\big{]}\leq\epsilon,

\displaystyle\mathbb{E}\big{[}\mbox{dist}(0,\partial L(x^{*},y_{[k]}^{*},\lambda^{*}))^{2}\big{]}\leq\epsilon,

\partial L (x, y_{[k]}, λ) = \nabla_{x} L (x, y_{[k]}, λ) \partial_{y_{1}} L (x, y_{[k]}, λ) \dots \partial_{y_{k}} L (x, y_{[k]}, λ) - A x - \sum_{j = 1}^{k} B_{j} y_{j} + c,

\partial L (x, y_{[k]}, λ) = \nabla_{x} L (x, y_{[k]}, λ) \partial_{y_{1}} L (x, y_{[k]}, λ) \dots \partial_{y_{k}} L (x, y_{[k]}, λ) - A x - \sum_{j = 1}^{k} B_{j} y_{j} + c,

∥\nabla f_{i} (x) - \nabla f_{i} (y) ∥ \leq L ∥ x - y ∥, \forall x, y \in R^{d},

∥\nabla f_{i} (x) - \nabla f_{i} (y) ∥ \leq L ∥ x - y ∥, \forall x, y \in R^{d},

f_{i} (x) \leq f_{i} (y) + \nabla f_{i} (y)^{T} (x - y) + \frac{L}{2} ∥ x - y ∥^{2} .

f_{i} (x) \leq f_{i} (y) + \nabla f_{i} (y)^{T} (x - y) + \frac{L}{2} ∥ x - y ∥^{2} .

L_{ρ} (x, y_{[k]}, λ)

L_{ρ} (x, y_{[k]}, λ)

+ \frac{ρ}{2} ∥ A x + j = 1 \sum k B_{j} y_{j} - c ∥^{2},

\displaystyle\hat{\nabla}f_{i}(x)=\sum_{j=1}^{d}\frac{1}{2\mu_{j}}\big{(}f_{i}(x+\mu_{j}e_{j})-f_{i}(x-\mu_{j}e_{j})\big{)}e_{j},

\displaystyle\hat{\nabla}f_{i}(x)=\sum_{j=1}^{d}\frac{1}{2\mu_{j}}\big{(}f_{i}(x+\mu_{j}e_{j})-f_{i}(x-\mu_{j}e_{j})\big{)}e_{j},

\left\{\begin{aligned} &y_{j}^{t+1}=\arg\min_{y_{j}}\big{\{}\mathcal{L}_{\rho}(x_{t},y_{[j-1]}^{t+1},y_{j},y_{[j+1:k]}^{t},\lambda_{t})\\ &\quad\qquad\qquad\qquad+\frac{1}{2}\|y_{j}-y_{j}^{t}\|^{2}_{H_{j}}\big{\}},\ \forall j\in[k]\\ &x_{t+1}=\arg\min_{x}\hat{\mathcal{L}}_{\rho}(x,y_{t+1},\lambda_{t},\hat{\nabla}f(x))\\ &\lambda_{t+1}=\lambda_{t}-\rho(Ax_{t+1}+By_{t+1}-c),\end{aligned}\right.

\left\{\begin{aligned} &y_{j}^{t+1}=\arg\min_{y_{j}}\big{\{}\mathcal{L}_{\rho}(x_{t},y_{[j-1]}^{t+1},y_{j},y_{[j+1:k]}^{t},\lambda_{t})\\ &\quad\qquad\qquad\qquad+\frac{1}{2}\|y_{j}-y_{j}^{t}\|^{2}_{H_{j}}\big{\}},\ \forall j\in[k]\\ &x_{t+1}=\arg\min_{x}\hat{\mathcal{L}}_{\rho}(x,y_{t+1},\lambda_{t},\hat{\nabla}f(x))\\ &\lambda_{t+1}=\lambda_{t}-\rho(Ax_{t+1}+By_{t+1}-c),\end{aligned}\right.

\displaystyle\hat{\mathcal{L}}_{\rho}\big{(}x,y_{[k]}^{t+1},\lambda_{t},\hat{\nabla}f(x)\big{)}\!=\!f(x_{t})\!+\!\hat{\nabla}f(x)^{T}(x-x_{t})

\displaystyle\hat{\mathcal{L}}_{\rho}\big{(}x,y_{[k]}^{t+1},\lambda_{t},\hat{\nabla}f(x)\big{)}\!=\!f(x_{t})\!+\!\hat{\nabla}f(x)^{T}(x-x_{t})

+ \frac{1}{2 η} ∥ x - x_{t} ∥_{G}^{2} + j = 1 \sum k ψ_{j} (y_{j}^{t + 1}) - λ_{t}^{T} (A x + j = 1 \sum k B_{j} y_{j}^{t + 1} - c)

+ \frac{ρ}{2} ∥ A x + j = 1 \sum k B_{j} y_{j}^{t + 1} - c ∥^{2},

R_{t}^{s} =

R_{t}^{s} =

\displaystyle\quad+\frac{18L^{2}d}{\sigma^{A}_{\min}\rho b}\|x^{s}_{t-1}-\tilde{x}^{s}\|^{2}+c_{t}\|x^{s}_{t}-\tilde{x}^{s}\|^{2}\big{]},

c_{t} = ⎩ ⎨ ⎧ \frac{36 L ^{2} d}{σ _{m i n}^{A} ρ b} + \frac{2 L d}{b} + (1 + β) c_{t + 1}, 1 \leq t \leq m, 0, t \geq m + 1.

c_{t} = ⎩ ⎨ ⎧ \frac{36 L ^{2} d}{σ _{m i n}^{A} ρ b} + \frac{2 L d}{b} + (1 + β) c_{t + 1}, 1 \leq t \leq m, 0, t \geq m + 1.

\displaystyle\min_{s,t}\mathbb{E}\big{[}\mbox{dist}(0,\partial L(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t}))^{2}\big{]}\leq O(\frac{\tilde{\nu}d^{2l}}{T})+O(d^{2+2l}\mu^{2}),

\displaystyle\min_{s,t}\mathbb{E}\big{[}\mbox{dist}(0,\partial L(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t}))^{2}\big{]}\leq O(\frac{\tilde{\nu}d^{2l}}{T})+O(d^{2+2l}\mu^{2}),

\displaystyle\frac{1}{\mu}=O\big{(}\frac{d^{1+l}}{\sqrt{\epsilon}}\big{)},\quad T=O\big{(}\frac{\tilde{\nu}d^{2l}}{\epsilon}\big{)},

\displaystyle\frac{1}{\mu}=O\big{(}\frac{d^{1+l}}{\sqrt{\epsilon}}\big{)},\quad T=O\big{(}\frac{\tilde{\nu}d^{2l}}{\epsilon}\big{)},

\displaystyle\Omega_{t}\!=\mathbb{E}\big{[}\mathcal{L}_{\rho}(x_{t},y_{[k]}^{t},\lambda_{t})\!+\!(\frac{3\sigma^{2}_{\max}(G)}{\sigma^{A}_{\min}\rho\eta^{2}}\!+\!\frac{9L^{2}}{\sigma^{A}_{\min}\rho})\|x_{t}\!-\!x_{t-1}\|^{2}

\displaystyle\Omega_{t}\!=\mathbb{E}\big{[}\mathcal{L}_{\rho}(x_{t},y_{[k]}^{t},\lambda_{t})\!+\!(\frac{3\sigma^{2}_{\max}(G)}{\sigma^{A}_{\min}\rho\eta^{2}}\!+\!\frac{9L^{2}}{\sigma^{A}_{\min}\rho})\|x_{t}\!-\!x_{t-1}\|^{2}

\displaystyle\quad+\frac{18L^{2}d}{\sigma^{A}_{\min}\rho b}\frac{1}{n}\sum_{i=1}^{n}\|x_{t-1}-z^{t-1}_{i}\|^{2}+c_{t}\frac{1}{n}\sum_{i=1}^{n}\|x_{t}-z^{t}_{i}\|^{2}\big{]}.

c_{t} = ⎩ ⎨ ⎧ \frac{36 L ^{2} d}{σ _{m i n}^{A} ρ b} + \frac{2 L d}{b} + (1 - \overset{p}{^}) (1 + β) c_{t + 1}, 0 \leq t \leq T - 1, 0, t \geq T,

c_{t} = ⎩ ⎨ ⎧ \frac{36 L ^{2} d}{σ _{m i n}^{A} ρ b} + \frac{2 L d}{b} + (1 - \overset{p}{^}) (1 + β) c_{t + 1}, 0 \leq t \leq T - 1, 0, t \geq T,

\displaystyle\min_{1\leq t\leq T}\mathbb{E}\big{[}\mbox{dist}(0,\partial L(x_{t},y_{[k]}^{t},\lambda_{t}))^{2}\big{]}\leq O(\frac{\tilde{\nu}d^{2l}}{T})\!+O(d^{2+2l}\mu^{2}),

\displaystyle\min_{1\leq t\leq T}\mathbb{E}\big{[}\mbox{dist}(0,\partial L(x_{t},y_{[k]}^{t},\lambda_{t}))^{2}\big{]}\leq O(\frac{\tilde{\nu}d^{2l}}{T})\!+O(d^{2+2l}\mu^{2}),

\displaystyle\frac{1}{\mu}=O\big{(}\frac{d^{1+l}}{\sqrt{\epsilon}}\big{)},\quad T=O\big{(}\frac{\tilde{\nu}d^{2l}}{\epsilon}\big{)},

\displaystyle\frac{1}{\mu}=O\big{(}\frac{d^{1+l}}{\sqrt{\epsilon}}\big{)},\quad T=O\big{(}\frac{\tilde{\nu}d^{2l}}{\epsilon}\big{)},

x \in R^{d} min \frac{1}{n} i = 1 \sum n f_{i} (x) + τ_{1} ∥ x ∥_{1} + τ_{2} ∥ \hat{G} x ∥_{1},

x \in R^{d} min \frac{1}{n} i = 1 \sum n f_{i} (x) + τ_{1} ∥ x ∥_{1} + τ_{2} ∥ \hat{G} x ∥_{1},

x \in R^{d} min

x \in R^{d} min

+ τ_{1} p = 1 \sum P q = 1 \sum Q ∥ x_{G_{p, q}} ∥_{2} + τ_{2} ∥ x ∥_{2}^{2} + τ_{3} h (x),

E ∥ λ_{t + 1}^{s} - λ_{t}^{s} ∥^{2} \leq

E ∥ λ_{t + 1}^{s} - λ_{t}^{s} ∥^{2} \leq

+ (\frac{3 σ _{m a x}^{2} ( G )}{σ _{m i n}^{A} η ^{2}} + \frac{9 L ^{2}}{σ _{m i n}^{A}}) ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2} + \frac{9 L ^{2} d ^{2} μ ^{2}}{σ _{m i n}^{A}} .

\overset{g}{^}_{t}^{s} + \frac{1}{η} G (x_{t + 1}^{s} - x_{t}^{s}) - A^{T} λ_{t}^{s} + ρ A^{T} (A x_{t + 1}^{s} + j = 1 \sum k B_{j} y_{j}^{s, t + 1} - c) = 0.

\overset{g}{^}_{t}^{s} + \frac{1}{η} G (x_{t + 1}^{s} - x_{t}^{s}) - A^{T} λ_{t}^{s} + ρ A^{T} (A x_{t + 1}^{s} + j = 1 \sum k B_{j} y_{j}^{s, t + 1} - c) = 0.

A^{T} λ_{t + 1}^{s} = \overset{g}{^}_{t}^{s} + \frac{1}{η} G (x_{t + 1}^{s} - x_{t}^{s}) .

A^{T} λ_{t + 1}^{s} = \overset{g}{^}_{t}^{s} + \frac{1}{η} G (x_{t + 1}^{s} - x_{t}^{s}) .

\displaystyle\lambda^{s}_{t+1}=(A^{T})^{+}\big{(}\hat{g}^{s}_{t}+\frac{1}{\eta}G(x^{s}_{t+1}-x^{s}_{t})\big{)},

\displaystyle\lambda^{s}_{t+1}=(A^{T})^{+}\big{(}\hat{g}^{s}_{t}+\frac{1}{\eta}G(x^{s}_{t+1}-x^{s}_{t})\big{)},

E ∥ λ_{t + 1}^{s} - λ_{t}^{s} ∥^{2}

E ∥ λ_{t + 1}^{s} - λ_{t}^{s} ∥^{2}

\displaystyle\leq\frac{1}{\sigma^{A}_{\min}}\big{[}3\mathbb{E}\|\hat{g}^{s}_{t}-\hat{g}^{s}_{t-1}\|^{2}+\frac{3\sigma^{2}_{\max}(G)}{\eta^{2}}\mathbb{E}\|x^{s}_{t+1}-x^{s}_{t}\|^{2}+\frac{3\sigma^{2}_{\max}(G)}{\eta^{2}}\mathbb{E}\|x^{s}_{t}-x^{s}_{t-1}\|^{2}\big{]},

E ∥ \overset{g}{^}_{t}^{s} - \overset{g}{^}_{t - 1}^{s} ∥^{2}

E ∥ \overset{g}{^}_{t}^{s} - \overset{g}{^}_{t - 1}^{s} ∥^{2}

\leq 3 E ∥ \overset{g}{^}_{t}^{s} - \nabla f (x_{t}^{s}) ∥^{2} + 3 E ∥\nabla f (x_{t}^{s}) - \nabla f (x_{t - 1}^{s}) ∥^{2} + 3 E ∥\nabla f (x_{t - 1}^{s}) - \overset{g}{^}_{t - 1}^{s} ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsAlternating Direction Method of Multipliers

Full text

Zeroth-Order Stochastic Alternating Direction Method of Multipliers

for Nonconvex Nonsmooth Optimization

Feihu Huang1, Shangqian Gao1, Songcan Chen2,3 and Heng Huang1,4

1 Department of Electrical & Computer Engineering, University of Pittsburgh, USA

2 College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics

3 MIIT Key Laboratory of Pattern Analysis & Machine Intelligence, China

4 JD Finance America Corporation

[email protected], [email protected], [email protected], [email protected] Corresponding Author.

Abstract

Alternating direction method of multipliers (ADMM) is a popular optimization tool for the composite and constrained problems in machine learning. However, in many machine learning problems such as black-box learning and bandit feedback, ADMM could fail because the explicit gradients of these problems are difficult or even infeasible to obtain. Zeroth-order (gradient-free) methods can effectively solve these problems due to that the objective function values are only required in the optimization. Recently, though there exist a few zeroth-order ADMM methods, they build on the convexity of objective function. Clearly, these existing zeroth-order methods are limited in many applications. In the paper, thus, we propose a class of fast zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) for solving nonconvex problems with multiple nonsmooth penalties, based on the coordinate smoothing gradient estimator. Moreover, we prove that both the ZO-SVRG-ADMM and ZO-SAGA-ADMM have convergence rate of $O(1/T)$ , where $T$ denotes the number of iterations. In particular, our methods not only reach the best convergence rate of $O(1/T)$ for the nonconvex optimization, but also are able to effectively solve many complex machine learning problems with multiple regularized penalties and constraints. Finally, we conduct the experiments of black-box binary classification and structured adversarial attack on black-box deep neural network to validate the efficiency of our algorithms.

1 Introduction

Alternating direction method of multipliers (ADMM Gabay and Mercier (1976); Boyd et al. (2011)) is a popular optimization tool for solving the composite and constrained problems in machine learning. In particular, ADMM can efficiently optimize some problems with complicated structure regularization such as the graph-guided fused lasso Kim et al. (2009), which is too complicated for the other popular optimization methods such as proximal gradient methods Beck and Teboulle (2009). For the large-scale optimization, the stochastic ADMM method Ouyang et al. (2013) has been proposed. Recently, some faster stochastic ADMM methods Suzuki (2014); Zheng and Kwok (2016) have been proposed by using the variance reduced (VR) techniques such as the SVRG Johnson and Zhang (2013). In fact, ADMM is also highly successful in solving various nonconvex problems such as training deep neural networks Taylor et al. (2016). Thus, some fast nonconvex stochastic ADMM methods have been developed in Huang et al. (2016, 2019a).

Currently, most of the ADMM methods need to compute the gradients of objective functions over each iteration. However, in many machine learning problems, the explicit expression of gradient for objective function is difficult or infeasible to obtain. For example, in black-box situations, only prediction results (i.e., function values) are provided Chen et al. (2017); Liu et al. (2018b). In bandit settings Agarwal et al. (2010), player only receives the partial feedback in terms of loss function values, so it is impossible to obtain expressive gradient of the loss function. Clearly, the classic optimization methods, based on the first-order gradient or second-order information, are not competent to these problems. Recently, the zeroth-order optimization methods Duchi et al. (2015); Nesterov and Spokoiny (2017) are developed by only using the function values in the optimization.

In the paper, we focus on using the zeroth-order methods to solve the following nonconvex nonsmooth problem:

[TABLE]

where $A\in\mathbb{R}^{p\times d}$ , $B_{j}\in\mathbb{R}^{p\times q}$ for all $j\in[k],\ k\geq 1$ , $f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x):\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a nonconvex and black-box function, and each $\psi_{j}(y_{j}):\mathbb{R}^{q}\rightarrow\mathbb{R}$ is a convex and nonsmooth function. In machine learning, function $f(x)$ can be used for the empirical loss, $\sum_{j=1}^{k}\psi_{j}(y_{j})$ for multiple structure penalties (e.g., sparse + group sparse), and the constraint for encoding the structure pattern of model parameters such as graph structure. Due to the flexibility in splitting the objective function into loss $f(x)$ and each penalty $\psi_{j}(y_{j})$ , ADMM is an efficient method to solve the obove problem. However, in the problem (1), we only access the objective values rather than the whole explicit function $F(x,y_{[k]})$ , thus the classic ADMM methods are unsuitable for the problem (1).

Recently, Gao et al. (2018); Liu et al. (2018a) proposed the zeroth-order stochastic ADMM methods, which only use the objective values to optimize. However, these zeroth-order ADMM-based methods build on the convexity of objective function. Clearly, these methods are limited in many nonconvex problems such as adversarial attack on black-box deep neural network (DNN). At the same time, due to that the problem (1) includes multiple nonsmooth regularization functions and an equality constraint, the existing zeroth-order algorithms Liu et al. (2018b); Huang et al. (2019b) are not suitable for solving this problem.

In the paper, thus, we propose a class of fast zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) to solve the problem (1) based on the coordinate smoothing gradient estimator Liu et al. (2018b). In particular, the ZO-SVRG-ADMM and ZO-SAGA-ADMM methods build on the SVRG Johnson and Zhang (2013) and SAGA Defazio et al. (2014), respectively. Moreover, we study the convergence properties of the proposed methods. Table 1 shows the convergence properties of the proposed methods and other related ones.

1.1 Challenges and Contributions

Although both SVRG and SAGA show good performances in the first-order and second-order methods, applying these techniques to the nonconvex zeroth-order ADMM method is not trivial. There exists at least two main challenges:

•

Due to failure of the Fejér monotonicity of iteration, the convergence analysis of the nonconvex ADMM is generally quite difficult Wang et al. (2015). With using the inexact zeroth-order estimated gradient, this difficulty becomes greater in the nonconvex ADMM methods.

•

To guarantee convergence of our zeroth-order ADMM methods, we need to design a new effective Lyapunov function, which can not follow the existing nonconvex (stochastic) ADMM methods Jiang et al. (2019); Huang et al. (2016).

Thus, we carefully establish the Lyapunov functions in the following theoretical analysis to ensure convergence of the proposed methods. In summary, our major contributions are given below:

We propose a class of fast zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) to solve the problem (1).

2)

We prove that both the ZO-SVRG-ADMM and ZO-SAGA-ADMM have convergence rate of $O(\frac{1}{T})$ for nonconvex nonsmooth optimization. In particular, our methods not only reach the existing best convergence rate $O(\frac{1}{T})$ for the nonconvex optimization, but also are able to effectively solve many machine learning problems with multiple complex regularized penalties.

3)

Extensive experiments conducted on black-box classification and structured adversarial attack on black-box DNNs validate efficiency of the proposed algorithms.

2 Related Works

Zeroth-order (gradient-free) optimization is a powerful optimization tool for solving many machine learning problems, where the gradient of objective function is not available or computationally prohibitive. Recently, the zeroth-order optimization methods are widely applied and studied. For example, zeroth-order optimization methods have been applied to bandit feedback analysis Agarwal et al. (2010) and black-box attacks on DNNs Chen et al. (2017); Liu et al. (2018b). Nesterov and Spokoiny (2017) have proposed several random zeroth-order methods based on the Gaussian smoothing gradient estimator. To deal with the nonsmooth regularization, Gao et al. (2018); Liu et al. (2018a) have proposed the zeroth-order online/stochastic ADMM-based methods.

So far, the above algorithms mainly build on the convexity of problems. In fact, the zeroth-order methods are also highly successful in solving various nonconvex problems such as adversarial attack to black-box DNNs Liu et al. (2018b). Thus, Ghadimi and Lan (2013); Liu et al. (2018b); Gu et al. (2018) have begun to study the zeroth-order stochastic methods for the nonconvex optimization. To deal with the nonsmooth regularization, Ghadimi et al. (2016); Huang et al. (2019b) have proposed some non-convex zeroth-order proximal stochastic gradient methods. However, these methods still are not well competent to some complex machine learning problems such as a task of structured adversarial attack to the black-box DNNs, which is described in the following experiment.

2.1 Notations

Let $y_{[k]}=\{y_{1},\cdots,y_{k}\}$ and $y_{[j:k]}=\{y_{j},\cdots,y_{k}\}$ for $j\in[k]$ . Given a positive definite matrix $G$ , $\|x\|^{2}_{G}=x^{T}Gx$ ; $\sigma_{\max}(G)$ and $\sigma_{\min}(G)$ denote the largest and smallest eigenvalues of $G$ , respectively, and $\kappa_{G}=\frac{\sigma_{\max}(G)}{\sigma_{\min}(G)}$ . $\sigma^{A}_{\max}$ and $\sigma^{A}_{\min}$ denote the largest and smallest eigenvalues of matrix $A^{T}A$ .

3 Preliminaries

In the section, we begin with restating a standard $\epsilon$ -approximate stationary point of the problem (1), as in Jiang et al. (2019); Huang et al. (2019a).

Definition 1.

Given $\epsilon>0$ , the point $(x^{*},y_{[k]}^{*},\lambda^{*})$ is said to be an $\epsilon$ -approximate stationary point of the problems (1), if it holds that

[TABLE]

where $L(x,y_{[k]},\lambda)=f(x)+\sum_{j=1}^{k}\psi_{j}(y_{j})-\langle\lambda,Ax+\sum_{j=1}^{k}B_{j}y_{j}-c\rangle$ ,

[TABLE]

$\mbox{dist}(0,\partial L)=\inf_{L^{\prime}\in\partial L}\|0-L^{\prime}\|.$ **

Next, we make some mild assumptions regarding problem (1) as follows:

Assumption 1.

Each function $f_{i}(x)$ is $L$ -smooth for $\forall i\in\{1,2,\cdots,n\}$ such that

[TABLE]

which is equivalent to

[TABLE]

Assumption 2.

Full gradient of loss function $f(x)$ is bounded, i.e., there exists a constant $\delta>0$ such that for all $x$ , it follows that $\|\nabla f(x)\|^{2}\leq\delta^{2}$ .

Assumption 3.

$f(x)$ * and $\psi_{j}(y_{j})$ for all $j\in[k]$ are all lower bounded, and denote $f^{*}=\inf_{x}f(x)$ and $\psi_{j}^{*}=\inf_{y}\psi_{j}(y)$ for $j\in[k]$ .*

Assumption 4.

$A$ * is a full row or column rank matrix.*

Assumption 1 has been commonly used in the convergence analysis of nonconvex algorithms Ghadimi et al. (2016). Assumption 2 is widely used for stochastic gradient-based and ADMM-type methods Boyd et al. (2011). Assumptions 3 and 4 are usually used in the convergence analysis of ADMM methods Jiang et al. (2019); Huang et al. (2016, 2019a). Without loss of generality, we will use the full column rank of matrix $A$ in the rest of this paper.

4 Fast Zeroth-Order Stochastic ADMMs

In this section, we propose a class of zeroth-order stochastic ADMM methods to solve the problem (1). First, we define an augmented Lagrangian function of the problem (1):

[TABLE]

where $\lambda\in\mathbb{R}^{p}$ and $\rho>0$ denotes the dual variable and penalty parameter, respectively.

In the problem (1), the explicit expression of objective function $f_{i}(x)$ is not available, and only the function value of $f_{i}(x)$ is available. To avoid computing explicit gradient, thus, we use the coordinate smoothing gradient estimator Liu et al. (2018b) to estimate gradients: for $i\in[n]$ ,

[TABLE]

where $\mu_{j}$ is a coordinate-wise smoothing parameter, and $e_{j}$ is a standard basis vector with 1 at its $j$ -th coordinate, and 0 otherwise.

Based on the above estimated gradients, we propose a zeroth-order ADMM (ZO-ADMM) method to solve the problem (1) by executing the following iterations, for $t\!=\!1,2,\cdots$

[TABLE]

where the term $\frac{1}{2}\|y_{j}-y_{j}^{t}\|^{2}_{H_{j}}$ with $H_{j}\succ 0$ to linearize the term $\|Ax+\sum_{j=1}^{k}B_{j}y_{j}-c\|^{2}$ . Here, due to using the inexact zeroth-order gradient to update $x$ , we define an approximate function over $x_{t}$ as follows:

[TABLE]

where $G\succ 0$ , $\hat{\nabla}f(x)$ is the zeroth-order gradient and $\eta>0$ is a step size. Considering the matrix $A^{T}A$ is large, set $G=rI-\rho\eta A^{T}A\succ I$ with $r>\rho\eta\sigma_{\max}(A^{T}A)+1$ to linearize the term $\|Ax+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\|^{2}$ . In the problem (1), not only the noisy gradient of $f_{i}(x)$ is not available, but also the sample size $n$ is very large. Thus, we propose fast ZO-SVRG-ADMM and ZO-SAGA-ADMM to solve the problem (1), based on the SVRG and SAGA, respectively.

Algorithm 1 shows the algorithmic framework of ZO-SVRG-ADMM. In Algorithm 1, we use the estimated stochastic gradient $\hat{g}_{t}^{s}=\hat{\nabla}f_{\mathcal{I}_{t}}(x_{t}^{s})-\hat{\nabla}f_{\mathcal{I}_{t}}(\tilde{x}^{s})+\hat{\nabla}f(\tilde{x}^{s})$ with $\hat{\nabla}f_{\mathcal{I}_{t}}(x^{s}_{t})=\frac{1}{b}\sum_{i_{t}\in\mathcal{I}_{t}}\hat{\nabla}f_{i_{t}}(x^{s}_{t})$ . We have $\mathbb{E}_{\mathcal{I}_{t}}[\hat{g}_{t}^{s}]=\hat{\nabla}f(x_{t}^{s})\neq\nabla f(x_{t}^{s})$ , i.e., this stochastic gradient is a biased estimate of the true full gradient. Although the SVRG has shown a great promise, it relies upon the assumption that the stochastic gradient is an **unbiased **estimate of true full gradient. Thus, adapting the similar ideas of SVRG to zeroth-order ADMM optimization is not a trivial task. To handle this challenge, we choose the appropriate step size $\eta$ , penalty parameter $\rho$ and smoothing parameter $\mu$ to guarantee the convergence of our algorithms, which will be discussed in the following convergence analysis.

Algorithm 2 shows the algorithmic framework of ZO-SAGA-ADMM. In Algorithm 2, we use the estimated stochastic gradient $\hat{g}_{t}=\frac{1}{b}\sum_{i_{t}\in\mathcal{I}_{t}}\big{(}\hat{\nabla}f_{i_{t}}(x_{t})-\hat{\nabla}f_{i_{t}}(z^{t}_{i_{t}})\big{)}+\hat{\phi}_{t}$ with $\hat{\phi}_{t}=\frac{1}{n}\sum_{i=1}^{n}\hat{\nabla}f_{i}(z^{t}_{i})$ . Similarly, we have $\mathbb{E}_{\mathcal{I}_{t}}[\hat{g}_{t}]=\hat{\nabla}f(x_{t})\neq\nabla f(x_{t})$ .

5 Convergence Analysis

In this section, we will study the convergence properties of the proposed algorithms (ZO-SVRG-ADMM and ZO-SAGA-ADMM).

5.1 Convergence Analysis of ZO-SVRG-ADMM

In this subsection, we analyze convergence properties of the ZO-SVRG-ADMM.

Given the sequence $\{(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t})_{t=1}^{m}\}_{s=1}^{S}$ generated from Algorithm 1, we define a Lyapunov function:

[TABLE]

where the positive sequence $\{c_{t}\}$ satisfies

[TABLE]

Next, we definite a useful variable $\theta^{s}_{t}\!=\!\mathbb{E}\big{[}\|x^{s}_{t+1}-x^{s}_{t}\|^{2}+\|x^{s}_{t}-x^{s}_{t-1}\|^{2}+\frac{d}{b}(\|x^{s}_{t}-\tilde{x}^{s}\|^{2}+\|x^{s}_{t-1}-\tilde{x}^{s}\|^{2})+\sum_{j=1}^{k}\|y_{j}^{s,t}-y_{j}^{s,t+1}\|^{2}\big{]}$ .

Theorem 1.

Suppose the sequence $\{(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t})_{t=1}^{m}\}_{s=1}^{S}$ is generated from Algorithm 1. Let $m=n^{\frac{1}{3}}$ , $b=d^{1-l}n^{\frac{2}{3}},\ l\in\{0,\frac{1}{2},1\}$ , $\eta=\frac{\alpha\sigma_{\min}(G)}{9d^{l}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{71}\kappa_{G}d^{l}L}{\sigma^{A}_{\min}\alpha}$ , then we have

[TABLE]

where $\tilde{\nu}=R^{1}_{0}-R^{*}$ , and $R^{*}$ is a lower bound of function $R^{s}_{t}$ . It follows that suppose the smoothing parameter $\mu$ and the whole iteration number $T=mS$ satisfy

[TABLE]

then $(x^{s^{*}}_{t^{*}},y_{[k]}^{s^{*},t^{*}},\lambda^{s^{*}}_{t^{*}})$ is an $\epsilon$ -approximate stationary point of the problems (1), where $(t^{*},s^{*})=\mathop{\arg\min}_{t,s}\theta^{s}_{t}$ .

Remark 1.

Theorem 1 shows that given $m=n^{\frac{1}{3}}$ , $b=d^{1-l}n^{\frac{2}{3}},\ l\in\{0,\frac{1}{2},1\}$ , $\eta=\frac{\alpha\sigma_{\min}(G)}{9d^{l}L}\ (0<\alpha\leq 1)$ , $\rho=\frac{6\sqrt{71}\kappa_{G}d^{l}L}{\sigma^{A}_{\min}\alpha}$ and $\mu=O(\frac{1}{d\sqrt{T}})$ , the ZO-SVRG-ADMM has convergence rate of $O(\frac{d^{2l}}{T})$ . Specifically, when $1\leq d<n^{\frac{1}{3}}$ , given $l=0$ , the ZO-SVRG-ADMM has convergence rate of $O(\frac{1}{T})$ ; when $n^{\frac{1}{3}}\leq d<n^{\frac{2}{3}}$ , given $l=\frac{1}{2}$ , it has convergence rate of $O(\frac{\sqrt{d}}{T})$ ; when $n^{\frac{2}{3}}\leq d$ , given $l=1$ , it has convergence rate of $O(\frac{d}{T})$ .

5.2 Convergence Analysis of ZO-SAGA-ADMM

In this subsection, we provide the convergence analysis of the ZO-SAGA-ADMM.

Given the sequence $\{x_{t},y_{[k]}^{t},\lambda_{t}\}_{t=1}^{T}$ generated from Algorithm 2, we define a Lyapunov function

[TABLE]

Here the positive sequence $\{c_{t}\}$ satisfies

[TABLE]

where $\hat{p}$ denotes probability of an index $i$ in $\mathcal{I}_{t}$ . Next, we definite a useful variable $\theta_{t}\!=\!\mathbb{E}\big{[}\|x_{t+1}\!-\!x_{t}\|^{2}\!+\!\|x_{t}\!-\!x_{t-1}\|^{2}\!+\!\frac{d}{bn}\sum^{n}_{i=1}(\|x_{t}\!-\!z^{t}_{i}\|^{2}\!+\!\|x_{t-1}\!-\!z^{t-1}_{i}\|^{2})\!+\!\sum_{j=1}^{k}\|y_{j}^{t}\!-\!y_{j}^{t+1}\|^{2}\big{]}$ .

Theorem 2.

Suppose the sequence $\{x_{t},y_{[k]}^{t},\lambda_{t}\}_{t=1}^{T}$ is generated from Algorithm 2. Let $b=n^{\frac{2}{3}}d^{\frac{1-l}{3}},\ l\in\{0,\frac{1}{2},1\}$ , $\eta=\frac{\alpha\sigma_{\min}(G)}{33d^{l}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{791}\kappa_{G}d^{l}L}{\sigma^{A}_{\min}\alpha}$ then we have

[TABLE]

where $\tilde{\nu}=\Omega_{0}-\Omega^{*}$ , and $\Omega^{*}$ is a lower bound of function $\Omega_{t}$ . It follows that suppose the parameters $\mu$ and $T$ satisfy

[TABLE]

then $(x_{t^{*}},y_{[k]}^{t^{*}},\lambda_{t^{*}})$ is an $\epsilon$ -approximate stationary point of the problems (1), where $t^{*}=\mathop{\arg\min}_{1\leq t\leq T}\theta_{t}$ .

Remark 2.

Theorem 2 shows that $b=n^{\frac{2}{3}}d^{\frac{1-l}{3}},\ l\in\{0,\frac{1}{2},1\}$ , $\eta=\frac{\alpha\sigma_{\min}(G)}{33d^{l}L}\ (0<\alpha\leq 1)$ , $\rho=\frac{6\sqrt{791}\kappa_{G}d^{l}L}{\sigma^{A}_{\min}\alpha}$ and $\mu=O(\frac{1}{d\sqrt{T}})$ , the ZO-SAGA-ADMM has the $O(\frac{d^{2l}}{T})$ of convergence rate. Specifically, when $1\leq d<n$ , given $l=0$ , the ZO-SAGA-ADMM has convergence rate of $O(\frac{1}{T})$ ; when $n\leq d<n^{2}$ , given $l=\frac{1}{2}$ , it has convergence rate of $O(\frac{d}{T})$ ; when $n^{2}\leq d$ , given $l=1$ , it has convergence rate of $O(\frac{d^{2}}{T})$ .

6 Experiments

In this section, we compare our algorithms (ZO-SVRG-ADMM, ZO-SAGA-ADMM) with the ZO-ProxSVRG, ZO-ProxSAGA Huang et al. (2019b), the deterministic zeroth-order ADMM (ZO-ADMM), and zeroth-order stochastic ADMM (ZO-SGD-ADMM) without variance reduction on two applications: 1) robust black-box binary classification, and 2) structured adversarial attacks on black-box DNNs.

6.1 Robust Black-Box Binary Classification

In this subsection, we focus on a robust black-box binary classification task with graph-guided fused lasso. Given a set of training samples $(a_{i},l_{i})_{i=1}^{n}$ , where $a_{i}\in\mathbb{R}^{d}$ and $l_{i}\in\{-1,+1\}$ , we find the optimal parameter $x\in\mathbb{R}^{d}$ by solving the problem:

[TABLE]

where $f_{i}(x)$ is the black-box loss function, that only returns the function value given an input. Here, we specify the loss function $f_{i}(x)=\frac{\sigma^{2}}{2}\big{(}1-\exp(-\frac{(l_{i}-a_{i}^{T}x)^{2}}{\sigma^{2}})\big{)}$ , which is the nonconvex robust correntropy induced loss He et al. (2011). Matrix $\hat{G}$ decodes the sparsity pattern of graph obtained by learning sparse Gaussian graphical model Huang and Chen (2015). In the experiment, we give mini-batch size $b=20$ , smoothing parameter $\mu=\frac{1}{d\sqrt{t}}$ and penalty parameters $\tau_{1}=\tau_{2}=10^{-5}$ .

In the experiment, we use some public real datasets11120news is from https://cs.nyu.edu/~roweis/data.html; others are from www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/., which are summarized in Table 2. For each dataset, we use half of the samples as training data and the rest as testing data. Figure 1 shows that the objective values of our algorithms faster decrease than the other algorithms, as the CPU time increases. In particular, our algorithms show better performances than the zeroth-order proximal algorithms. It is relatively difficult that these zeroth-order proximal methods deal with the nonsmooth penalties in the problem (6). Thus, we have to use some iterative methods to solve the proximal operator in these proximal methods.

6.2 Structured Attacks on Black-Box DNNs

In this subsection, we use our algorithms to generate adversarial examples to attack the pre-trained DNN models, whose parameters are hidden from us and only its outputs are accessible. Moreover, we consider an interesting problem: “What possible structures could adversarial perturbations have to fool black-box DNNs ?” Thus, we use the zeroth-order algorithms to find an universal structured adversarial perturbation $x\in\mathbb{R}^{d}$ that could fool the samples $\{a_{i}\in\mathbb{R}^{d},\ l_{i}\in\mathbb{N}\}_{i=1}^{n}$ , which can be regarded as the following problem:

[TABLE]

where $F(a)$ represents the final layer output before softmax of neural network, and $h(x)$ ensures the validness of created adversarial examples. Specifically, $h(x)=0$ if $a_{i}+x\in[0,1]^{d}$ for all $i\in[n]$ and $\|x\|_{\infty}\leq\epsilon$ , otherwise $h(x)=\infty$ . Following Xu et al. (2018), we use the overlapping lasso to obtain structured perturbations. Here, the overlapping groups $\{\mathcal{G}_{p,q}\},\ p=1,\cdots,P,\ q=1,\cdots,Q$ generate from dividing an image into sub-groups of pixels.

In the experiment, we use the pre-trained DNN models on MNIST and CIFAR-10 as the target black-box models, which can attain $99.4\%$ and $80.8\%$ test accuracy, respectively. For MNIST, we select 20 samples from a target class and set batch size $b=4$ ; For CIFAR-10, we select 30 samples and set $b=5$ . In the experiment, we set $\mu=\frac{1}{d\sqrt{t}}$ , where $d=28\times 28$ and $d=3\times 32\times 32$ for MNIST and CIFAR-10, respectively. At the same time, we set the parameters $\epsilon=0.4$ , $\tau_{1}=1$ , $\tau_{2}=2$ and $\tau_{3}=1$ . For both datasets, the kernel size for overlapping group lasso is set to $3\times 3$ and the stride is one.

Figure 3 shows that attack losses (i.e. the first term of the problem (6.2)) of our methods faster decrease than the other methods, as the number of iteration increases. Figure 2 shows that our algorithms can learn some structure perturbations, and can successfully attack the corresponding DNNs.

7 Conclusions

In the paper, we proposed fast ZO-SVRG-ADMM and ZO-SAGA-ADMM methods based on the coordinate smoothing gradient estimator, which only uses the objective function values to optimize. Moreover, we prove that the proposed methods have a convergence rate of $O(\frac{1}{T})$ . In particular, our methods not only reach the existing best convergence rate $O(\frac{1}{T})$ for the nonconvex optimization, but also are able to effectively solve many machine learning problems with the complex nonsmooth regularizations.

Acknowledgments

F.H., S.G., H.H. were partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956. S.C. was partially supported by the NSFC under Grant No. 61806093 and No. 61682281, and the Key Program of NSFC under Grant No. 61732006.

Appendix A Supplementary Materials

In this section, we study at detail the convergence properties of both the ZO-SVRG-ADMM and ZO-SAGA-ADMM algorithms.

Notations: To make the paper easier to follow, we give the following notations:

•

$[k]=\{1,2,\cdots,k\}$ and $[j:k]=\{j,j+1,\cdots,k\}$ for all $1\leq j\leq k$ .

•

$\|\cdot\|$ denotes the vector $\ell_{2}$ norm and the matrix spectral norm, respectively.

•

$\|x\|_{G}=\sqrt{x^{T}Gx}$ , where $G$ is a positive definite matrix.

•

$\sigma^{A}_{\min}$ and $\sigma^{A}_{\max}$ denote the minimum and maximum eigenvalues of $A^{T}A$ , respectively.

•

$\sigma^{B_{j}}_{\max}$ denotes the maximum eigenvalues of $B_{j}^{T}B_{j}$ for all $j\in[k]$ , and $\sigma^{B}_{\max}=\max_{j=1}^{k}\sigma^{B_{j}}_{\max}$ .

•

$\sigma_{\min}(G)$ and $\sigma_{\max}(G)$ denote the minimum and maximum eigenvalues of matrix $G$ , respectively; the conditional number $\kappa_{G}=\frac{\sigma_{\max}(G)}{\sigma_{\min}(G)}$ .

•

$\sigma_{\min}(H_{j})$ and $\sigma_{\max}(H_{j})$ denote the minimum and maximum eigenvalues of matrix $H_{j}$ for all $j\in[k]$ , respectively; $\sigma_{\min}(H)=\min_{j=1}^{k}\sigma_{\min}(H_{j})$ and $\sigma_{\max}(H)=\max_{j=1}^{k}\sigma_{\max}(H_{j})$ .

•

$\mu$ denotes the smoothing parameter of the gradient estimator.

•

$\eta$ denotes the step size of updating variable $x$ .

•

$L$ denotes the Lipschitz constant of $\nabla f(x)$ .

•

$b$ denotes the mini-batch size of stochastic gradient.

•

$T$ , $m$ and $S$ are the total number of iterations, the number of iterations in the inner loop, and the number of iterations in the outer loop, respectively.

A.1 Theoretical Analysis of the ZO-SVRG-ADMM

In this subsection, we in detail give the convergence analysis of the ZO-SVRG-ADMM algorithm. First, we give some useful lemmas as follows:

Lemma 1.

Suppose the sequence $\big{\{}(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t})_{t=1}^{m}\big{\}}_{s=1}^{S}$ is generated by Algorithm 1, the following inequality holds

[TABLE]

Proof.

Using the optimal condition for the step 9 of Algorithm 1, we have

[TABLE]

By the step 11 of Algorithm 1, we have

[TABLE]

It follows that

[TABLE]

where $(A^{T})^{+}$ is the pseudoinverse of $A^{T}$ . By Assumption 4, i.e., $A$ is a full column matrix, we have $(A^{T})^{+}=A(A^{T}A)^{-1}$ . Then we have

[TABLE]

where $\sigma^{A}_{\min}$ denotes the minimum eigenvalues of $A^{T}A$ .

Next, considering the upper bound of $\|\hat{g}^{s}_{t}-\hat{g}^{s}_{t-1}\|^{2}$ , we have

[TABLE]

where the second inequality holds by Lemma 1 of Huang et al. [2019b] and the third inequality holds by Assumption 1. Finally, combining (A.1) and (A.1), we obtain the above result. ∎

Lemma 2.

Suppose the sequence $\{(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t})_{t=1}^{m}\}_{s=1}^{S}$ is generated from Algorithm 1, and define a Lyapunov function:

[TABLE]

where the positive sequence $\{c_{t}\}$ satisfies, for $s=1,2,\cdots,S$

[TABLE]

It follows that

[TABLE]

where $\gamma=\min(\sigma_{\min}^{H},L,\chi_{t})$ , $\chi_{t}\geq\frac{3\sqrt{71}\kappa_{G}Ld^{l}}{2\alpha}>0\ (l=0,0.5,1)$ , and $R^{*}$ denotes a lower bound of $R^{s}_{t}$ .

Proof.

By the optimal condition of step 8 in Algorithm 1, we have, for $j\in[k]$

[TABLE]

where the first inequality holds by the convexity of function $\psi_{j}(y)$ , and the second equality follows by applying the equality $(a-b)^{T}b=\frac{1}{2}(\|a\|^{2}-\|b\|^{2}-\|a-b\|^{2})$ on the term $(By_{j}^{s,t}-By_{j}^{s,t+1})^{T}(Ax_{t}^{s}+\sum_{i=1}^{j}B_{i}y_{i}^{s,t+1}+\sum_{i=j+1}^{k}B_{i}y_{i}^{s,t}-c)$ . Thus, we have, for all $j\in[k]$

[TABLE]

Telescoping inequality (17) over $j$ from $1$ to $k$ , we obtain

[TABLE]

where $\sigma_{\min}^{H}=\min_{j\in[k]}\sigma_{\min}(H_{j})$ .

By Assumption 1, we have

[TABLE]

Using the optimal condition of the step 9 in Algorithm 1, we have

[TABLE]

Combining (19) and (20), we have

[TABLE]

where the equality $(i)$ holds by applying the equality $(a-b)^{T}b=\frac{1}{2}(\|a\|^{2}-\|b\|^{2}-\|a-b\|^{2})$ on the term $(Ax^{s}_{t}-Ax^{s}_{t+1})^{T}(Ax^{s}_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{s,t+1}-c)$ , the inequality $(ii)$ holds by the inequality $a^{T}b\leq\frac{L}{2}\|a\|^{2}+\frac{1}{2L}\|b\|^{2}$ , and the inequality $(iii)$ holds by Lemma 1 of Huang et al. [2019b]. Thus, we obtain

[TABLE]

Using the step 10 in Algorithm 1, we have

[TABLE]

Combining (18), (22) and (A.1), we have

[TABLE]

Next, we define a Lyapunov function $R^{s}_{t}$ as follows:

[TABLE]

Considering the upper bound of $\|x^{s}_{t+1}-\tilde{x}^{s}\|^{2}$ , we have

[TABLE]

where the above inequality holds by the Cauchy-Schwarz inequality with $\beta>0$ . Combining (25) with (A.1), then we obtain

[TABLE]

where $c_{t}=\frac{36L^{2}d}{\sigma^{A}_{\min}b\rho}+\frac{2Ld}{b}+(1+\beta)c_{t+1}$ and $\chi_{t}=\frac{\sigma_{\min}(G)}{\eta}+\frac{\rho\sigma^{A}_{\min}}{2}-L-\frac{6\sigma^{2}_{\max}(G)}{\sigma^{A}_{\min}\eta^{2}\rho}-\frac{9L^{2}}{\sigma^{A}_{\min}\rho}-(1+1/\beta)c_{t+1}$ .

Next, we will prove the relationship between $R^{s+1}_{1}$ and $R^{s}_{m}$ . Due to $x^{s+1}_{0}=x^{s}_{m}=\tilde{x}^{s+1}$ , we have

[TABLE]

It follows that

[TABLE]

where the first inequality holds by the inequality $\mathbb{E}\|\zeta-\mathbb{E}\zeta\|^{2}=\mathbb{E}\|\zeta\|^{2}-\|\mathbb{E}\zeta\|^{2}$ ; the third inequality holds by the definition of zeroth-order gradient (4).

By Lemma 1, we have

[TABLE]

Since $x^{s}_{m}=x^{s+1}_{0}$ , $y_{j}^{s,m}=y_{j}^{s+1,0}$ for all $j\in[k]$ and $\lambda^{s}_{m}=\lambda^{s+1}_{0}$ , by (18), we have

[TABLE]

By (22), we have

[TABLE]

By (A.1), we have

[TABLE]

where the second inequality holds by (A.1).

Combining (A.1), (32) with (A.1), we have

[TABLE]

Therefore, we have

[TABLE]

where $c_{m}=\frac{36L^{2}d}{\sigma^{A}_{\min}\rho b}+\frac{2Ld}{b}$ , and $\chi_{m}=\frac{\sigma_{\min}(G)}{\eta}+\frac{\rho\sigma^{A}_{\min}}{2}-L-\frac{6\sigma^{2}_{\max}(G)}{\sigma^{A}_{\min}\eta^{2}\rho}-\frac{9L^{2}}{\sigma^{A}_{\min}\rho}-c_{1}$ .

Let $c_{m+1}=0$ and $\beta=\frac{1}{m}$ , recursing on $t$ , we have

[TABLE]

where the above inequality holds by $(1+\frac{1}{m})^{m}$ is an increasing function and $\lim_{m\rightarrow\infty}(1+\frac{1}{m})^{m}=e$ . It follows that, for $t=1,2,\cdots,m$

[TABLE]

When $1\leq d<n^{\frac{1}{3}}$ , let $m=n^{\frac{1}{3}}$ , $b=dn^{\frac{2}{3}}$ (i.e., $b=d^{1-l}n^{\frac{2}{3}}\ l=0$ ) and $0<\eta\leq\frac{\sigma_{\min}(G)}{9L}$ , we have $T_{1}\geq 0$ . Further, let $\eta=\frac{\alpha\sigma_{\min}(G)}{9L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{71}\kappa_{G}L}{\sigma^{A}_{\min}\alpha}$ , we have

[TABLE]

where the second inequality follows $\rho=\frac{6\sqrt{71}\kappa_{G}L}{\sigma^{A}_{\min}\alpha}$ . Thus, we have $\chi_{t}\geq\frac{3\sqrt{71}\kappa_{G}L}{2\alpha}>0$ for all $t\in\{1,2,\cdots,m\}$ .

When $n^{\frac{1}{3}}\leq d<n^{\frac{2}{3}}$ , let $m=n^{\frac{1}{3}}$ , $b=d^{\frac{1}{2}}n^{\frac{2}{3}}$ (i.e., $b=d^{1-l}n^{\frac{2}{3}}\ l=0.5$ ) and $0<\eta\leq\frac{\sigma_{\min}(G)}{9\sqrt{d}L}$ , we have $T_{1}\geq 0$ . Further, let $\eta=\frac{\alpha\sigma_{\min}(G)}{9\sqrt{d}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{71d}\kappa_{G}L}{\sigma^{A}_{\min}\alpha}$ , we have

[TABLE]

where the second equality follows by $\rho=\frac{6\sqrt{71d}\kappa_{G}L}{\sigma^{A}_{\min}\alpha}$ . Thus, we have $\chi_{t}\geq\frac{3\sqrt{71d}\kappa_{G}L}{2\alpha}>0$ .

When $n^{\frac{2}{3}}\leq d$ , let $m=n^{\frac{1}{3}}$ , $b=n^{\frac{2}{3}}$ (i.e., $b=d^{1-l}n^{\frac{2}{3}}\ l=1$ ) and $0<\eta\leq\frac{\sigma_{\min}(G)}{9dL}$ , we have $T_{1}\geq 0$ . Further, let $\eta=\frac{\alpha\sigma_{\min}(G)}{9dL}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{71}\kappa_{G}dL}{\sigma^{A}_{\min}\alpha}$ , we have

[TABLE]

where the second equality follows by $\rho=\frac{6\sqrt{71}\kappa_{G}dL}{\sigma^{A}_{\min}\alpha}$ . Thus, we have $\chi_{t}\geq\frac{3\sqrt{71}\kappa_{G}dL}{2\alpha}>0$ .

By Assumption 4. i.e., $A$ is a full column rank matrix, we have $(A^{T})^{+}=A(A^{T}A)^{-1}$ . It follows that $\sigma_{\max}((A^{T})^{+})^{T}(A^{T})^{+})=\sigma_{\max}((A^{T}A)^{-1})=\frac{1}{\sigma_{\min}^{A}}$ . Since , we have

[TABLE]

where the first inequality is obtained by applying $\langle a,b\rangle\leq\frac{1}{2\beta}\|a\|^{2}+\frac{\beta}{2}\|b\|^{2}$ to the terms $\langle(A^{T})^{+}(\hat{\nabla}f(x_{t})-\nabla f(x_{t})),Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\rangle$ , $\langle(A^{T})^{+}\nabla f(x_{t}),Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\rangle$ and $\langle(A^{T})^{+}\frac{G}{\eta}(x_{t+1}-x_{t}),Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\rangle$ with $\beta=\frac{\rho}{5}$ , respectively; the second inequality follows by Lemma 1 of Huang et al. [2019b] and Assumption 2. Using the definition of function $R^{s}_{t}$ and Assumption 3, we have

[TABLE]

Thus the function $R^{s}_{t}$ is bounded from below. Let $R^{*}$ denotes a lower bound of $R^{s}_{t}$ .

Finally, telescoping (A.2) and (A.1) over $t$ from [math] to $m-1$ and over $s$ from $1$ to $S$ , we have

[TABLE]

where $\gamma=\min(\sigma_{\min}^{H},L,\chi_{t})$ and $\chi_{t}\geq\frac{3\sqrt{71}\kappa_{G}Ld^{l}}{2\alpha}>0\ (l=0,0.5,1)$ .

∎

Next, based on the above lemmas, we give the convergence analysis of ZO-SVRG-ADMM algorithm. For notational simplicity, let

[TABLE]

Theorem 3.

Suppose the sequence $\{(x^{s}_{t},y_{[k]}^{s,t},\lambda^{s}_{t})_{t=1}^{m}\}_{s=1}^{S}$ is generated from Algorithm 1. Let $m=[n^{\frac{1}{3}}]$ , $b=[d^{1-l}n^{\frac{2}{3}}],\ l\in\{0,\frac{1}{2},1\}$ , $\eta=\frac{\alpha\sigma_{\min}(G)}{9d^{l}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{71}\kappa_{G}d^{l}L}{\sigma^{A}_{\min}\alpha}$ , then we have

[TABLE]

where $\gamma=\min(\sigma_{\min}^{H},\chi_{t},L)$ with $\chi_{t}\geq\frac{3\sqrt{71}\kappa_{G}d^{l}L}{2\alpha}$ , $\nu_{\max}=\max(\nu_{2},\nu_{3},\nu_{4})$ and $R^{*}$ is a lower bound of function $R^{s}_{t}$ . It follows that suppose the smoothing parameter $\mu$ and the whole iteration number $T=mS$ satisfy

[TABLE]

then $(x^{s^{*}}_{t^{*}},y_{[k]}^{s^{*},t^{*}},\lambda^{s^{*}}_{t^{*}})$ is an $\epsilon$ -approximate solution of (1), where $(t^{*},s^{*})=\mathop{\arg\min}_{t,s}\theta^{s}_{t}$ .

Proof.

First, we define a useful variable $\theta^{s}_{t}=\mathbb{E}\big{[}\|x^{s}_{t+1}-x^{s}_{t}\|^{2}+\|x^{s}_{t}-x^{s}_{t-1}\|^{2}+\frac{d}{b}(\|x^{s}_{t}-\tilde{x}^{s}\|^{2}+\|x^{s}_{t-1}-\tilde{x}^{s}\|^{2})+\sum_{j=1}^{k}\|y_{j}^{s,t}-y_{j}^{s,t+1}\|^{2}\big{]}$ . By the step 8 of Algorithm 1, we have, for all $i\in[k]$

[TABLE]

where the first inequality follows by the inequality $\|\frac{1}{n}\sum_{i=1}^{n}z_{i}\|^{2}\leq\frac{1}{n}\sum_{i=1}^{n}\|z_{i}\|^{2}$ .

By the step 9 of Algorithm 1, we have

[TABLE]

By the step 10 of Algorithm 1, we have

[TABLE]

Next, combining the above inequalities (A.1), (A.1) and (A.1), we have

[TABLE]

where the third inequality holds by Lemma 2, $\nu_{\max}=\max(\nu_{1},\nu_{2},\nu_{3})$ , $\gamma=\min(\sigma_{\min}^{H},\chi_{t},L)$ , and $\chi_{t}\geq\frac{3\sqrt{71}\kappa_{G}Ld^{l}}{2\alpha}>0\ (l=0,0.5,1)$ .

Given $\eta=\frac{\alpha\sigma_{\min}(G)}{9d^{l}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{71}\kappa_{G}Ld^{l}}{\sigma^{A}_{\min}\alpha}$ , it is easy verifies that $\gamma=O(1)$ and $\nu_{\max}=O(d^{2l})$ , which are independent on $n$ and $d$ . Thus, we obtain

[TABLE]

∎

A.2 Theoretical Analysis of the ZO-SAGA-ADMM

In this subsection, we in detail give the convergence analysis of the ZO-SAGA-ADMM algorithm. We begin with giving some useful lemmas as follows:

Lemma 3.

Suppose the sequence $\{x_{t},y_{[k]}^{t},\lambda_{t}\}_{t=1}^{T}$ is generated by Algorithm 2. The following inequality holds

[TABLE]

Proof.

By the optimize condition of the the step 7 in Algorithm 2, we have

[TABLE]

Using the step 8 of Algorithm 2, then we have

[TABLE]

It follows that

[TABLE]

where $(A^{T})^{+}$ is the pseudoinverse of $A^{T}$ . By Assumption 4, i.e., $A$ is a full column matrix, we have $(A^{T})^{+}=A(A^{T}A)^{-1}$ . Then we have

[TABLE]

Next, considering the upper bound of $\|\hat{g}^{s}_{t}-\hat{g}^{s}_{t-1}\|^{2}$ , we have

[TABLE]

where the second inequality holds by lemma 3 of Huang et al. [2019b], and the third inequality holds by Assumption 1.

Finally, combining the inequalities (A.2) and (A.2), we can obtain the above result. ∎

Lemma 4.

Suppose the sequence $\{x_{t},y_{[k]}^{t},\lambda_{t}\}_{t=1}^{T}$ is generated from Algorithm 2, and define a Lyapunov function

[TABLE]

where the positive sequence $\{c_{t}\}$ satisfies

[TABLE]

It follows that

[TABLE]

where $\gamma=\min(\sigma_{\min}^{H},L,\chi_{t})$ and $\chi_{t}\geq\frac{3\sqrt{791}\kappa_{G}d^{l}}{2\alpha}\ (l=0,0.5,1)$ , and $\Omega^{*}$ denotes a lower bound of $\Omega_{t}$ .

Proof.

By the optimal condition of step 6 in Algorithm 2, we have, for $j\in[k]$

[TABLE]

where the first inequality holds by the convexity of function $\psi_{j}(y)$ , and the second equality follows by applying the equality $(a-b)^{T}b=\frac{1}{2}(\|a\|^{2}-\|b\|^{2}-\|a-b\|^{2})$ on the term $(By_{j}^{t}-By_{j}^{t+1})^{T}(Ax_{t}+\sum_{i=1}^{j}B_{i}y_{i}^{t+1}+\sum_{i=j+1}^{k}B_{i}y_{i}^{t}-c)$ . Thus, we have, for all $j\in[k]$

[TABLE]

Telescoping inequality (57) over $j$ from $1$ to $k$ , we obtain

[TABLE]

where $\sigma_{\min}^{H}=\min_{j\in[k]}\sigma_{\min}(H_{j})$ .

By Assumption 1, we have

[TABLE]

Using the step 7 of Algorithm 2, we have

[TABLE]

Combining (59) and (60), we have

[TABLE]

where the equality $(i)$ holds by applying the equality $(a-b)^{T}b=\frac{1}{2}(\|a\|^{2}-\|b\|^{2}-\|a-b\|^{2})$ on the term $(Ax_{t}-Ax_{t+1})^{T}(Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c)$ ; the inequality $(ii)$ follows by the inequality $a^{T}b\leq\frac{L}{2}\|a\|^{2}+\frac{1}{2L}\|a\|^{2}$ , and the inequality $(iii)$ holds by lemma 3 of Huang et al. [2019b]. Thus, we obtain

[TABLE]

By the step 8 in Algorithm 2, we have

[TABLE]

Combining (58), (A.2) and (A.2), we have

[TABLE]

Next, we define a Lyapunov function as follows:

[TABLE]

By the step 9 of Algorithm 2, we have

[TABLE]

where $p$ denotes probability of an index $i$ being in $\mathcal{I}_{t}$ . Here, we have

[TABLE]

where the first inequality follows from $(1-a)^{b}\leq\frac{1}{1+ab}$ , and the second inequality holds by $b\leq n$ . Considering the upper bound of $\|x_{t+1}-z^{t}_{i}\|^{2}$ , we have

[TABLE]

where $\beta>0$ . Combining (A.2) with (A.2), we have

[TABLE]

It follows that

[TABLE]

where $c_{t}=\frac{36L^{2}d}{\sigma^{A}_{\min}\rho b}+\frac{2Ld}{b}+(1-p)(1+\beta)c_{t+1}$ .

Let $c_{T}=0$ and $\beta=\frac{b}{4n}$ . Since $(1-p)(1+\beta)=1+\beta-p-p\beta\leq 1+\beta-p$ and $p\geq\frac{b}{2n}$ , it follows that

[TABLE]

where $\theta=p-\beta\geq\frac{b}{4n}$ . Then recursing on $t$ , for $0\leq t\leq T-1$ , we have

[TABLE]

It follows that

[TABLE]

When $1\leq d<n$ , let $b=d^{\frac{1}{3}}n^{\frac{2}{3}}$ (i.e., $b=d^{\frac{1-l}{3}}n^{\frac{2}{3}},\ l=0$ ) and $0<\eta\leq\frac{\sigma_{\min}(G)}{33L}$ , we have $T_{1}\geq 0$ . Further, let $\eta=\frac{\alpha\sigma_{\min}(G)}{33L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{791}\kappa_{G}L}{\sigma^{A}_{\min}\alpha}$ , we have

[TABLE]

Thus, we have $\chi_{t}\geq\frac{3\sqrt{791}\kappa_{G}L}{2\alpha}$ .

When $n\leq d<n^{2}$ , let $b=d^{\frac{1}{6}}n^{\frac{2}{3}}$ (i.e., $b=d^{\frac{1-l}{3}}n^{\frac{2}{3}},\ l=0.5$ ) and $0<\eta\leq\frac{\sigma_{\min}(G)}{33\sqrt{d}L}$ , we have $T_{1}\geq 0$ . Further, let $\eta=\frac{\alpha\sigma_{\min}(G)}{33\sqrt{d}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{791d}\kappa_{G}L}{\sigma^{A}_{\min}\alpha}$ , we have

[TABLE]

Thus, we have $\chi_{t}\geq\frac{3\sqrt{791d}\kappa_{G}L}{2\alpha}$ .

When $n^{2}\leq d$ , let $b=n^{\frac{2}{3}}$ (i.e., $b=d^{\frac{1-l}{3}}n^{\frac{2}{3}},\ l=1$ ) and $0<\eta\leq\frac{\sigma_{\min}(G)}{33dL}$ , we have $T_{1}\geq 0$ . Further, let $\eta=\frac{\alpha\sigma_{\min}(G)}{33dL}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{791}\kappa_{G}d}{\sigma^{A}_{\min}\alpha}$ , we have

[TABLE]

Thus, we have $\chi_{t}\geq\frac{3\sqrt{791}\kappa_{G}d}{2\alpha}$ .

By Assumption 4, i.e., $A$ is a full column rank matrix, we have $(A^{T})^{+}=A(A^{T}A)^{-1}$ . It follows that $\sigma_{\max}((A^{T})^{+})^{T}(A^{T})^{+})=\sigma_{\max}((A^{T}A)^{-1})=\frac{1}{\sigma_{\min}^{A}}$ . Since $\lambda_{t+1}=(A^{T})^{+}\big{(}\hat{g}_{t}+\frac{G}{\eta}(x_{t+1}-x_{t})\big{)}$ , we have

[TABLE]

where the first inequality is obtained by applying $\langle a,b\rangle\leq\frac{1}{2\beta}\|a\|^{2}+\frac{\beta}{2}\|b\|^{2}$ to the terms $\langle(A^{T})^{+}(\hat{\nabla}f(x_{t})-\nabla f(x_{t})),Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\rangle$ , $\langle(A^{T})^{+}\nabla f(x_{t}),Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\rangle$ and $\langle(A^{T})^{+}\frac{G}{\eta}(x_{t+1}-x_{t}),Ax_{t+1}+\sum_{j=1}^{k}B_{j}y_{j}^{t+1}-c\rangle$ with $\beta=\frac{\rho}{5}$ , respectively; the second inequality follows by Lemma 3 of Huang et al. [2019b] and Assumption 2. By definition of the function $\Omega_{t}$ and Assumption 3, we have

[TABLE]

Thus, the function $\Omega_{t}$ is bounded from below. Let $\Omega^{*}$ denotes a lower bound of $\Omega_{t}$ .

Finally, telescoping inequality (A.2) over $t$ from [math] to $T$ , we have

[TABLE]

where $\gamma=\min(\sigma_{\min}^{H},L,\chi_{t})$ and $\chi_{t}\geq\frac{3\sqrt{791}\kappa_{G}d^{l}}{2\alpha}\ (l=0,0.5,1)$ .

∎

Next, based on the above lemmas, we give the convergence properties of the ZO-SAGA-ADMM algorithm. For notational simplicity, let

[TABLE]

Theorem 4.

Suppose the sequence $\{x_{t},y_{[k]}^{t},\lambda_{t}\}_{t=1}^{T}$ is generated from Algorithm 2. Let $b=n^{\frac{2}{3}}d^{\frac{1-l}{3}},\ l\in\{0,\frac{1}{2},1\}$ , $\eta=\frac{\alpha\sigma_{\min}(G)}{33d^{l}L}\ (0<\alpha\leq 1)$ and $\rho=\frac{6\sqrt{791}\kappa_{G}d^{l}L}{\sigma^{A}_{\min}\alpha}$ then we have

[TABLE]

where $\gamma=\min(\sigma_{\min}^{H},\chi_{t},L)$ with $\chi_{t}\geq\frac{3\sqrt{791}\kappa_{G}d^{l}L}{2\alpha}$ , $\nu_{\max}=\max(\nu_{2},\nu_{3},\nu_{4})$ and $\Omega^{*}$ is a lower bound of function $\Omega_{t}$ . It follows that suppose the parameters $\mu$ and $T$ satisfy

[TABLE]

then $(x_{t^{*}},y_{[k]}^{t^{*}},\lambda_{t^{*}})$ is an $\epsilon$ -approximate solution of (1), where $t^{*}=\mathop{\arg\min}_{1\leq t\leq T}\theta_{t}$ .

Proof.

We begin with defining an useful variable $\theta_{t}=\mathbb{E}\big{[}\|x_{t+1}-x_{t}\|^{2}+\|x_{t}-x_{t-1}\|^{2}+\frac{d}{bn}\sum^{n}_{i=1}(\|x_{t}-z^{t}_{i}\|^{2}+\|x_{t-1}-z^{t-1}_{i}\|^{2})+\sum_{j=1}^{k}\|y_{j}^{t}-y_{j}^{t+1}\|^{2}\big{]}$ . By the optimal condition of the step 6 in Algorithm 2, we have, for all $i\in[k]$

[TABLE]

where the first inequality follows by the inequality $\|\sum_{i=1}^{r}\alpha_{i}\|^{2}\leq r\sum_{i=1}^{r}\|\alpha_{i}\|^{2}$ .

By the step 7 in Algorithm 2, we have

[TABLE]

By the step 8 of Algorithm 2, we have

[TABLE]

where the first inequality holds by Lemma 3.

Next, combining the above inequalities (A.2), (A.2) and (A.2), we have

[TABLE]

where the third inequality holds by Lemma 4, and $\nu_{\max}=\max(\nu_{1},\nu_{2},\nu_{3})$ , $\gamma=\min(\sigma_{\min}^{H},\chi_{t},L)$ , and $\chi_{t}\geq\frac{3\sqrt{791}\kappa_{G}d^{l}}{2\alpha}\ (l=0,0.5,1)$ .

Given $\eta=\frac{\alpha\sigma_{\min}(G)}{33d^{l}L}\ (0<\alpha\leq 1,\ l=0,0.5,1)$ and $\rho=\frac{6\sqrt{791}\kappa_{G}Ld^{l}}{\sigma^{A}_{\min}\alpha}$ , since $k$ is relatively small, it is easy verifies that $\gamma=O(1)$ and $\nu_{\max}=O(d^{2l})$ , which are independent on $n$ and $d$ . Thus, we obtain

[TABLE]

∎

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agarwal et al. [2010] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT , pages 28–40. Citeseer, 2010.
2Beck and Teboulle [2009] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences , 2(1):183–202, 2009.
3Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning , 3(1):1–122, 2011.
4Chen et al. [2017] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Workshop on Artificial Intelligence and Security , pages 15–26. ACM, 2017.
5Defazio et al. [2014] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS , pages 1646–1654, 2014.
6Duchi et al. [2015] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE TIT , 61(5):2788–2806, 2015.
7Gabay and Mercier [1976] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications , 2(1):17–40, 1976.
8Gao et al. [2018] Xiang Gao, Bo Jiang, and Shuzhong Zhang. On the information-adaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing , 76(1):327–363, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Zeroth-Order Stochastic Alternating Direction Method of Multipliers

Abstract

1 Introduction

1.1 Challenges and Contributions

2 Related Works

2.1 Notations

3 Preliminaries

Definition 1**.**

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

4 Fast Zeroth-Order Stochastic ADMMs

5 Convergence Analysis

5.1 Convergence Analysis of ZO-SVRG-ADMM

Theorem 1**.**

Remark 1**.**

5.2 Convergence Analysis of ZO-SAGA-ADMM

Theorem 2**.**

Remark 2**.**

6 Experiments

6.1 Robust Black-Box Binary Classification

6.2 Structured Attacks on Black-Box DNNs

7 Conclusions

Acknowledgments

Appendix A Supplementary Materials

A.1 Theoretical Analysis of the ZO-SVRG-ADMM

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Theorem 3**.**

Proof.

A.2 Theoretical Analysis of the ZO-SAGA-ADMM

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Theorem 4**.**

Proof.

Definition 1.

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 1.

Remark 1.

Theorem 2.

Remark 2.

Lemma 1.

Lemma 2.

Theorem 3.

Lemma 3.

Lemma 4.

Theorem 4.