Statistics with Set-Valued Functions: Applications to Inverse   Approximate Optimization

Anil Aswani

arXiv:1702.00708·math.OC·January 9, 2018·Math. Program.

Statistics with Set-Valued Functions: Applications to Inverse Approximate Optimization

Anil Aswani

PDF

TL;DR

This paper develops a statistical framework for set-valued functions using variational analysis, enabling consistent estimation in inverse approximate optimization with noisy data.

Contribution

It introduces operational tools for statistics with set-valued functions and applies them to inverse approximate optimization, ensuring statistical consistency under noise.

Findings

01

Previous methods are statistically inconsistent with noisy data.

02

The proposed approach achieves consistency under mild conditions.

03

Applications include nonparametric estimation of set-valued functions.

Abstract

Much of statistics relies upon four key elements: a law of large numbers, a calculus to operationalize stochastic convergence, a central limit theorem, and a framework for constructing local approximations. These elements are well-understood for objects in a vector space (e.g., points or functions); however, much statistical theory does not directly translate to sets because they do not form a vector space. Building on probability theory for random sets, this paper uses variational analysis to develop operational tools for statistics with set-valued functions. These tools are first applied to nonparametric estimation (kernel regression of set-valued functions). The second application is to the problem of inverse approximate optimization, in which approximate solutions (corrupted by noise) to an optimization problem are observed and then used to estimate the amount of suboptimality of…

Equations92

lim sup_{n} C_{n} = {x : \exists n_{k} s.t. x_{n_{k}} \to x with x_{n_{k}} \in C_{n_{k}}},

lim sup_{n} C_{n} = {x : \exists n_{k} s.t. x_{n_{k}} \to x with x_{n_{k}} \in C_{n_{k}}},

lim inf_{n} C_{n} = {x : \exists x_{n} \to x with x_{n} \in C_{n}} .

lim inf_{n} C_{n} = {x : \exists x_{n} \to x with x_{n} \in C_{n}} .

{lim in f_{n} f_{n} (x_{n}) \geq f (x) lim sup_{n} f_{n} (x_{n}) \leq f (x) for every sequence x_{n} \to x for some sequence x_{n} \to x

{lim in f_{n} f_{n} (x_{n}) \geq f (x) lim sup_{n} f_{n} (x_{n}) \leq f (x) for every sequence x_{n} \to x for some sequence x_{n} \to x

S \to \overline{S} lim sup G (S) = {u : \exists S_{n} \to \overline{S} s.t. u_{n} \to u with S_{n} \subseteq V, u_{n} \in G (S_{n})},

S \to \overline{S} lim sup G (S) = {u : \exists S_{n} \to \overline{S} s.t. u_{n} \to u with S_{n} \subseteq V, u_{n} \in G (S_{n})},

S \to \overline{S} lim inf G (S) = {u : \forall S_{n} \to \overline{S}, \exists u_{n} \to u with S_{n} \subseteq V, u_{n} \in G (S_{n})} .

S \to \overline{S} lim inf G (S) = {u : \forall S_{n} \to \overline{S}, \exists u_{n} \to u with S_{n} \subseteq V, u_{n} \in G (S_{n})} .

F (x^{'}) \subseteq F (x) + κ ∥ x^{'} - x ∥ B for x, x^{'} \in X,

F (x^{'}) \subseteq F (x) + κ ∥ x^{'} - x ∥ B for x, x^{'} \in X,

G(S^{\prime})\subseteq G(S)+\kappa{\ooalign{$d$\cr$\mkern 6.8mul$}}_{\infty}(S^{\prime},S)\mathbb{B}\ \text{ for }S,S^{\prime}\subseteq V\text{ with }S,S^{\prime}\in\mathcal{K}.

G(S^{\prime})\subseteq G(S)+\kappa{\ooalign{$d$\cr$\mkern 6.8mul$}}_{\infty}(S^{\prime},S)\mathbb{B}\ \text{ for }S,S^{\prime}\subseteq V\text{ with }S,S^{\prime}\in\mathcal{K}.

E (X) = cl {E ξ : ξ \in S^{1} (X)},

E (X) = cl {E ξ : ξ \in S^{1} (X)},

X_{n} = F_{in} with probability p_{in}, for i = 1, \dots, n, with \sum_{i = 1}^{n} p_{in} = 1

X_{n} = F_{in} with probability p_{in}, for i = 1, \dots, n, with \sum_{i = 1}^{n} p_{in} = 1

\textstyle\limsup_{n}\mathbb{P}(r_{n}{\ooalign{$d$\cr$\mkern 6.8mul$}}_{\infty}(S(C_{n}),S(C))\geq u)\leq\mathbb{P}(\kappa\cdot w\geq u),

\textstyle\limsup_{n}\mathbb{P}(r_{n}{\ooalign{$d$\cr$\mkern 6.8mul$}}_{\infty}(S(C_{n}),S(C))\geq u)\leq\mathbb{P}(\kappa\cdot w\geq u),

n \cdot ((\frac{1}{n} ⨁_{i = 1}^{n} X_{i}) ⊖ E (X)) \to N (0, E (ξ ξ^{T}))

n \cdot ((\frac{1}{n} ⨁_{i = 1}^{n} X_{i}) ⊖ E (X)) \to N (0, E (ξ ξ^{T}))

\textstyle\lim_{n}\sum_{i=1}^{n}n^{-2}\cdot\mathrm{var}\big{(}\varphi_{h}(X_{i}-u)\big{)}\leq\lim_{n}c/\big{(}nh^{d}\big{)}<\infty,

\textstyle\lim_{n}\sum_{i=1}^{n}n^{-2}\cdot\mathrm{var}\big{(}\varphi_{h}(X_{i}-u)\big{)}\leq\lim_{n}c/\big{(}nh^{d}\big{)}<\infty,

E (\frac{1}{n} \sum_{i = 1}^{n} φ_{h} (X_{i} - u))

E (\frac{1}{n} \sum_{i = 1}^{n} φ_{h} (X_{i} - u))

= \int_{z \in B \cap (U \oplus {- u}) / h} φ_{1} (z) \cdot f_{X} (u + h z) d z

\displaystyle\textstyle\big{|}\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u))-f_{X}(u)\cdot\int_{z\in\mathbb{B}\cap T_{U}(u)}\varphi_{1}(z)dz\big{|}

\displaystyle\textstyle\big{|}\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u))-f_{X}(u)\cdot\int_{z\in\mathbb{B}\cap T_{U}(u)}\varphi_{1}(z)dz\big{|}

\displaystyle\qquad\leq\textstyle\int_{z\in\mathbb{B}}\varphi_{1}(z)\cdot\big{|}f_{X}(u+hz)-f_{X}(u)\big{|}dz+\int_{z\in R(h)\cup S(h)}|\varphi_{1}(z)f_{X}(u+hz)|dz

\leq \int_{z \in B} φ_{1} (z) \cdot κh ∥ z ∥ d z + s \int_{z \in R (h)} d z + s \int_{z \in S (h)} d z

\leq h \cdot κ \int_{z \in B} φ_{1} (z) d z + s \int_{z \in R (h)} d z + s \int_{z \in S (h)} d z

\textstyle\lim_{n}\sum_{i=1}^{n}n^{-2}\cdot\mathrm{var}\big{(}\varphi_{h}(X_{i}-u)\cdot\langle w_{i}\rangle_{j}\big{)}\leq\lim_{n}c\cdot\mathrm{var}\big{(}\langle w\rangle_{j}\big{)}/\big{(}nh^{d}\big{)}<\infty,

\textstyle\lim_{n}\sum_{i=1}^{n}n^{-2}\cdot\mathrm{var}\big{(}\varphi_{h}(X_{i}-u)\cdot\langle w_{i}\rangle_{j}\big{)}\leq\lim_{n}c\cdot\mathrm{var}\big{(}\langle w\rangle_{j}\big{)}/\big{(}nh^{d}\big{)}<\infty,

\textstyle\lim_{n}\sum_{i=1}^{n}n^{-2}\cdot\mathrm{var}\big{(}\varphi_{h}(X_{i}-u)\cdot\|X_{i}-u\|\big{)}\leq\lim_{n}c/\big{(}nh^{d}\big{)}<\infty,

\textstyle\lim_{n}\sum_{i=1}^{n}n^{-2}\cdot\mathrm{var}\big{(}\varphi_{h}(X_{i}-u)\cdot\|X_{i}-u\|\big{)}\leq\lim_{n}c/\big{(}nh^{d}\big{)}<\infty,

E (\frac{1}{n} \sum_{i = 1}^{n} φ_{h} (X_{i} - u) \cdot ∥ X_{i} - u ∥)

E (\frac{1}{n} \sum_{i = 1}^{n} φ_{h} (X_{i} - u) \cdot ∥ X_{i} - u ∥)

\leq h \int_{z \in R^{d}} φ_{1} (z) \cdot z \cdot f_{X} (u + h z) d z

\leq c \cdot h = c \cdot n^{- 1/ (d + 4)}

\widehat{S}(u)=\textstyle\Big{[}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot S_{i}\Big{]}\cdot\Big{[}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\Big{]}^{-1}

\widehat{S}(u)=\textstyle\Big{[}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot S_{i}\Big{]}\cdot\Big{[}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\Big{]}^{-1}

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S(u)\oplus w_{i}\big{)}\subseteq\\ \textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S_{i}\oplus\kappa\|X_{i}-u\|\mathbb{B}\big{)}\subseteq\\ \textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S(u)\oplus 2\kappa\|X_{i}-u\|\mathbb{B}\oplus w_{i}\big{)}

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S(u)\oplus w_{i}\big{)}\subseteq\\ \textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S_{i}\oplus\kappa\|X_{i}-u\|\mathbb{B}\big{)}\subseteq\\ \textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S(u)\oplus 2\kappa\|X_{i}-u\|\mathbb{B}\oplus w_{i}\big{)}

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot s_{i}=\Big{\{}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot t_{i}:t_{i}\in s_{i}\text{ for }i=1,\ldots,n\Big{\}},

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot s_{i}=\Big{\{}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot t_{i}:t_{i}\in s_{i}\text{ for }i=1,\ldots,n\Big{\}},

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot s_{i}=\bigoplus_{k=1}^{p}\big{[}\textstyle\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot w_{ik}\big{]}\cdot z_{k},

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot s_{i}=\bigoplus_{k=1}^{p}\big{[}\textstyle\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot w_{ik}\big{]}\cdot z_{k},

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot s_{i}=\mathrm{co}\big{(}\big{\{}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot v_{ij_{i}}:\\ \text{ for }j_{i}=1,\ldots,p_{i}\text{ and }i=1,\ldots,n\big{\}}\big{)}.

\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot s_{i}=\mathrm{co}\big{(}\big{\{}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(x_{i}-u)\cdot v_{ij_{i}}:\\ \text{ for }j_{i}=1,\ldots,p_{i}\text{ and }i=1,\ldots,n\big{\}}\big{)}.

S(u)=\begin{cases}\big{[}\hskip 19.0633pt-2,-\frac{2u+1}{u}+2\big{]},&\text{if }u\in\big{[}-2,-\frac{1}{4}\big{]}\\ \big{[}\hskip 19.0633pt-2,\hskip 37.55785pt2\big{]},&\text{if }u\in\big{[}-\frac{1}{4},\hskip 6.544pt\frac{1}{4}\big{]}\\ \big{[}\frac{4u-1}{u}-2,\hskip 37.27303pt2\big{]},&\text{if }u\in\big{[}\hskip 10.243pt\frac{1}{4},\hskip 7.68236pt2\big{]}\end{cases}

S(u)=\begin{cases}\big{[}\hskip 19.0633pt-2,-\frac{2u+1}{u}+2\big{]},&\text{if }u\in\big{[}-2,-\frac{1}{4}\big{]}\\ \big{[}\hskip 19.0633pt-2,\hskip 37.55785pt2\big{]},&\text{if }u\in\big{[}-\frac{1}{4},\hskip 6.544pt\frac{1}{4}\big{]}\\ \big{[}\frac{4u-1}{u}-2,\hskip 37.27303pt2\big{]},&\text{if }u\in\big{[}\hskip 10.243pt\frac{1}{4},\hskip 7.68236pt2\big{]}\end{cases}

\textstyle V(u,\theta)=\min_{x}\big{\{}f(x,u,\theta)\ \big{|}\ g(x,u,\theta)\leq 0\big{\}}

\textstyle V(u,\theta)=\min_{x}\big{\{}f(x,u,\theta)\ \big{|}\ g(x,u,\theta)\leq 0\big{\}}

\textstyle S(u,\epsilon,\theta)=\operatorname*{\epsilon-arg}\min_{x}\big{\{}f(x,u,\theta)\ \big{|}\ g(x,u,\theta)\leq 0\big{\}}=\\ \{x:f(x,u,\theta)\leq V(u,\theta)+\epsilon,g(x,u,\theta)\leq 0\}.

\textstyle S(u,\epsilon,\theta)=\operatorname*{\epsilon-arg}\min_{x}\big{\{}f(x,u,\theta)\ \big{|}\ g(x,u,\theta)\leq 0\big{\}}=\\ \{x:f(x,u,\theta)\leq V(u,\theta)+\epsilon,g(x,u,\theta)\leq 0\}.

\hat{\epsilon}=\textstyle\max\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\langle g(Y_{i})\rangle_{1}^{+},\frac{1}{n}\sum_{i=1}^{n}\langle g(Y_{i})\rangle_{2}^{+},\frac{1}{n}\sum_{i=1}^{n}|2Y_{i}+\langle\Lambda_{i}\rangle_{1}-\langle\Lambda_{i}\rangle_{2}|,\\ \textstyle\frac{1}{n}\sum_{i=1}^{n}|\langle\Lambda_{i}\rangle_{1}\cdot(Y_{i}-1)|,\frac{1}{n}\sum_{i=1}^{n}|\langle\Lambda_{i}\rangle_{2}\cdot(-Y_{i}-1)|\Big{\}}

\hat{\epsilon}=\textstyle\max\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\langle g(Y_{i})\rangle_{1}^{+},\frac{1}{n}\sum_{i=1}^{n}\langle g(Y_{i})\rangle_{2}^{+},\frac{1}{n}\sum_{i=1}^{n}|2Y_{i}+\langle\Lambda_{i}\rangle_{1}-\langle\Lambda_{i}\rangle_{2}|,\\ \textstyle\frac{1}{n}\sum_{i=1}^{n}|\langle\Lambda_{i}\rangle_{1}\cdot(Y_{i}-1)|,\frac{1}{n}\sum_{i=1}^{n}|\langle\Lambda_{i}\rangle_{2}\cdot(-Y_{i}-1)|\Big{\}}

\textstyle\min\big{\{}\frac{1}{n}\sum_{i=1}^{n}|2Y_{i}+\langle\lambda_{i}\rangle_{1}-\langle\lambda_{i}\rangle_{2}|+\langle\lambda_{i}\rangle_{1}\cdot|Y_{i}-1|+\\ \textstyle\langle\lambda_{i}\rangle_{2}\cdot|-Y_{i}-1|\ \big{|}\ \lambda_{i}\geq 0,i=1,\ldots,n\big{\}}.

\textstyle\min\big{\{}\frac{1}{n}\sum_{i=1}^{n}|2Y_{i}+\langle\lambda_{i}\rangle_{1}-\langle\lambda_{i}\rangle_{2}|+\langle\lambda_{i}\rangle_{1}\cdot|Y_{i}-1|+\\ \textstyle\langle\lambda_{i}\rangle_{2}\cdot|-Y_{i}-1|\ \big{|}\ \lambda_{i}\geq 0,i=1,\ldots,n\big{\}}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: A. Aswani 22institutetext: Industrial Engineering and Operation Research, University of California, Berkeley, CA, USA

22email: [email protected]

Statistics with Set-Valued Functions††thanks: This work was supported in part by NSF Award CMMI-1450963.

Applications to Inverse Approximate Optimization

Anil Aswani

Abstract

Much of statistics relies upon four key elements: a law of large numbers, a calculus to operationalize stochastic convergence, a central limit theorem, and a framework for constructing local approximations. These elements are well-understood for objects in a vector space (e.g., points or functions); however, much statistical theory does not directly translate to sets because they do not form a vector space. Building on probability theory for random sets, this paper uses variational analysis to develop operational tools for statistics with set-valued functions. These tools are first applied to nonparametric estimation (kernel regression of set-valued functions). The second application is to the problem of inverse approximate optimization, in which approximate solutions (corrupted by noise) to an optimization problem are observed and then used to estimate the amount of suboptimality of the solutions and the parameters of the optimization problem that generated the solutions. We show that previous approaches to this problem are statistically inconsistent when the data is corrupted by noise, whereas our approach is consistent under mild conditions.

Keywords:

set-valued functions statistics inverse optimization

††journal: Mathematical Programming

1 Introduction

While statistical theory is well-developed for problems concerning (single-valued) functions bickel2006 ; van2000 , there has been less work on statistics with sets or set-valued functions. Most attention in statistics on sets has been focused on the problem of estimating a single set under different measurement models devroye1980 ; geffroy1964 ; guntuboyina2012 ; korostelev1995 ; patschkowski2016 ; renyi1963 ; scholkopf2001 . The problem of estimating set-valued functions is less well studied, though it has potential applications in varied domains including healthcare, robotics, and energy. For instance, we study in this paper the problem of inverse approximate optimization, where approximate solutions (corrupted by noise) to a parametric optimization problem are observed and then used to estimate the amount of suboptimality of the solutions and the parameters that generated the solutions. Inverse approximate optimization can be used to construct predictive models of human behavior and decision-making, where the explicit model is that an individual makes decisions by approximately solving an optimization problem. Statistical estimation in this context could be used to quantify the tradeoffs made by a particular individual between competing objectives, as well as quantify the predictability of the decision-making process. This particular problem of inverse approximate optimization is related to the broader topic of statistics with set-valued functions because the solution mapping of an (even strictly convex) optimization problem becomes a set when suboptimality of solutions is allowed. Thus a framework for statistics with set-valued functions is needed to study such problems.

A substantial impediment to studying such estimation problems is the lack of statistical tools for random sets and set-valued functions, and two technical issues prevent the use of existing tools. The first is that most statistical theory assumes objects belong to a vector space, which is the case for points and functions. But sets do not form a vector space, and so existing statistical theory cannot be used. This is a fundamental difficulty, and even the usual notion of expectation does not apply to sets molchanov2006 . The second is that most statistical theory has been developed by using metrics and distance functions to derive results. But analyzing sets using distances is difficult, and most analysis tools and results for sets do not use this approach berge1963 ; rockafellar2009 .

Arguably the most natural approach to statistics with random sets is to define a family of sets parametrized by a random vector, and then perform standard statistical analysis with respect to this parametrization. However it is not clear without further analysis whether stochastic convergence of the estimated parameters implies stochastic convergence of the corresponding set estimates. We study this question in a more general framework and give a counterexample to demonstrate how parameter convergence does not always imply set convergence. Moreover, the parametrization approach does not lead to a useful definition for the expectation of random sets molchanov2006 ; the reason is that the expectation of the parameters does not characterize the expectation of the set in a way in that ensures the law of large numbers holds.

One goal of this paper is to establish tools for statistics with set-valued functions, and this requires understanding four main ingredients: a law of large numbers, a calculus to operationalize stochastic convergence, a central limit theorem, and tools for constructing local approximations. Probability theory for random sets molchanov2006 provides an expectation for random sets aumann1965 ; kudo1953 , a law of large numbers artstein1975 , and a central limit theorem weil1982 . Here we use variational analysis rockafellar2009 to advocate a notion of local approximation for set-valued functions, and to develop results that allow us to interpret stochastic convergence and expectations of random sets as operators.

The paper begins by describing our notation and providing some useful definitions related to set-valued functions. We focus in this paper on almost sure (a.s.) convergence because the corresponding definitions and approach most clearly demonstrate the tight link between variational analysis and statistics. Defining set convergence in probability requires metrization, which partially obscures the relationship to variational analysis. We also focus on Lipschitz continuity for set-valued functions because we advocate using this concept as a notion of local approximation for set-valued functions. The utility of this approach is displayed later in the paper when we use Lipschitz continuity as a replacement for differentiability when proving a Delta method-like result and proving statistical consistency of a kernel regression estimator.

The next section shows how to interpret stochastic convergence and expectation of random sets as operators. We study the limit of sequences of sets under different set operations, after proving a set-based generalization of the continuous mapping theorem bickel2006 from statistics. Then we study the expectation of random sets under various set operations. Standard proofs about the properties of the expectation of random variables do not extend because the expectation of a random set cannot be computed by integration. This means properties like distribution of expectation under independence of the product of a random matrix with a random set or Jensen’s inequality have not been previously established, and we prove such results. We conclude by reviewing a law of large numbers and a central limit theorem for random sets.

Another goal of this paper is to study two problems of estimating set-valued functions, and through the process of analyzing these problems we demonstrate the utility of our tools for statistics with set-valued functions. The first problem we study is estimating a set-valued function using noisy measurements of the set. We propose a kernel regression estimator that can be interpreted as a generalization of methods for functions aswani2011 ; aswani2013 ; bickel1982 ; noda1976 ; wand1994 . The key step in proving statistical consistency is using Lipschitz continuity of the set-valued function to construct local approximations. We show that statistical consistency follows by combining our results on stochastic convergence with convergence bounds on (vector-valued) random variables.

The second problem we study is inverse approximate optimization, where noisy measurements of approximate solutions to an optimization problem are used to estimate the suboptimality of the solutions and the parameters of the optimization problem. In contrast, past work on inverse optimization assumes no noise ahuja2001 ; chan2014 ; esfahani2015 or exact solutions aswani2015 ; bertsimas2015 ; keshavarz2011 . We develop a method for inverse approximate optimization and prove its statistical consistency using stochastic epi-convergence aswani2011 ; dupacova1988 ; geyer1994 ; knight2000 ; lachout2005 ; salinetti1986 . Combining with our results on stochastic convergence and results on the continuity of solutions to optimization problems rockafellar2009 ; royset2016b shows our method consistently estimates the (set-valued) approximate solution mapping that generates the data.

We conclude by examining extensions of the problem of inverse approximate optimization, as well as discussing related open questions about statistics with set-valued functions. In particular, we describe how some extensions lead to formulations of optimization problems with structures (e.g., objective functions that are integrals whose domain of integration depends on the decision variable) that have not been well-studied from the perspective of numerical optimization. Performing statistics with sets and set-valued functions also leads to questions about the design of numerical representations of sets. We argue that further study of statistics with set-valued functions will require developing new numerical methods and optimization theory.

2 Preliminaries

This section presents the notation used in this paper, as well as several useful concepts from variational analysis. Most of the variational analysis definitions are from rockafellar2009 . The definition of set-valued set functions is from matheron1975 , and we use the definitions of the Minkowski set operations from schneider1993 . We abbreviate almost surely using a.s.

2.1 Notation

Let $\mathcal{F}(E)$ be the space of closed subsets of $E$ , and let $\mathcal{K}(E)$ be the space of compact subsets of $E$ . We will focus on cases where $E$ is a Euclidean space, and so will use the notation $\mathcal{F},\mathcal{K}$ to refer to the corresponding spaces. Clearly $\mathcal{F}\supset\mathcal{K}$ by definition.

Suppose $C,D$ are sets and $\Psi$ is a matrix or scalar. We use the set notation: $C\cup D$ is the union of $C,D$ ; $C\cap D$ is the intersection of $C,D$ ; $C\subseteq D$ denotes that $C$ is a subset of $D$ ; $C\supseteq D$ denotes that $C$ is a superset of $D$ ; $\mathrm{cl}(C)$ is the closure of $C$ ; $\mathrm{co}(C)$ is the convex hull of $C$ ; $C^{\ \mathsf{c}}$ is the complement of $C$ ; $\partial C$ is the boundary of $C$ ; $C\oplus D=\{c+d:c\in C,d\in D\}$ is the Minkowski sum of $C,D$ ; $C\ominus D=\{x:x\oplus D\subseteq C\}$ is the Minkowski difference of $C,D$ ; $\Psi\cdot C=\{\Psi\cdot c:c\in C\}$ ; and $\Psi^{-1}C=\{\Psi^{-1}\cdot c:c\in C\}$ .

2.2 Limit Definitions and Set-Valued Mappings

The outer limit of the sequence of sets $C_{n}$ is defined as

[TABLE]

and the inner limit of the sequence of sets $C_{n}$ is defined as

[TABLE]

The outer limit consists of all the cluster points of $C_{n}$ , whereas the inner limit consists of all limit points of $C_{n}$ . The limit of the sequence of sets $C_{n}$ exists if the outer and inner limits are equal, and we define that $\textstyle\lim_{n}C_{n}:=\limsup_{n}C_{n}=\liminf_{n}C_{n}$ .

Let $\overline{\mathbb{R}}=[-\infty,\infty]$ denote the extended real line. A sequence of extended-real-valued functions $f_{n}:X\rightarrow\overline{\mathbb{R}}$ is said to epi-converge to $f$ if at each $x\in X$ we have

[TABLE]

The notion of epi-convergence is so-named because it is equivalent to set convergence of the epigraphs of $f_{n}$ , meaning that epi-convergence is equivalent to the condition $\lim_{n}\{(x,\alpha)\in X\times\mathbb{R}:f_{n}(x)\leq\alpha\}=\{(x,\alpha)\in X\times\mathbb{R}:f(x)\leq\alpha\}$ .

A set-valued set function $G:V\Rightarrow U$ assigns to each set $S\subseteq V$ a set $G(S)\subseteq U$ . The outer limit of $G$ at the set $\overline{S}\in V$ is defined as

[TABLE]

and the inner limit of $G$ at the set $\overline{S}\subseteq V$ is defined as

[TABLE]

The intuition is similar to the notions for sequences of sets. The set-valued set function $G$ is outer semicontinuous (osc) at $\overline{S}$ if $\limsup_{S\rightarrow\overline{S}}G(S)\subseteq G(\overline{S})$ , and $G$ is inner semicontinuous (isc) at $\overline{S}$ if $\liminf_{S\rightarrow\overline{S}}G(S)\supseteq G(\overline{S})$ . The set-valued set function $G$ is continuous at $\overline{S}$ when it is both osc and isc, that is when $\lim_{S\rightarrow\overline{S}}G(S)=G(\overline{S})$ .

Variational analysis typically uses set-valued functions, rather than set-valued set functions. A set-valued function $F:X\mathbin{\ooalign{$ \scriptstyle\rightarrow $\cr\raise 3.22916pt\hbox{$ \scriptstyle\rightarrow $}}}U$ assigns to each point $x\in X$ a set $F(x)\subseteq U$ . Outer limits, inner limits, outer semicontinuity, inner semicontinuity, and continuity are defined as above but with points replacing sets in the domain. Moreover, a set-valued function applied pointwise to sets is an osc, isc, continuous set-valued set function whenever the set-valued function is osc, isc, continuous, respectively.

2.3 Probability Definitions and Stochastic Convergence

Let $(\Omega,\mathfrak{F},\mathbb{P})$ be a complete probability space, where $\Omega$ is the sample space, $\mathfrak{F}$ is the set of events, and $\mathbb{P}$ is the probability measure. A map $S:\Omega\rightarrow\mathcal{F}$ is a random set if $\{\omega:S(\omega)\in\mathcal{X}\}\in\mathfrak{F}$ for each $\mathcal{X}$ in the Borel $\sigma$ -algebra on $\mathcal{F}$ molchanov2006 . Like the usual convention for random variables, we notationally drop the argument for a random set.

When discussing samples for estimation, we use the convention that capital letters denote random variables, and lowercase letters denote measured data. Also, we use the notation $U(a,b)$ to specify a uniform distribution with support $[a,b]$ .

We next define almost sure stochastic convergence of random sets. The notation $\operatorname*{as-lim\,sup}_{n}C_{n}\subseteq C$ denotes $\mathbb{P}(\limsup_{n}C_{n}\subseteq C)=1$ , the notation $\operatorname*{as-lim\,inf}_{n}C_{n}\supseteq C$ denotes $\mathbb{P}(\liminf_{n}C_{n}\supseteq C)=1$ , and the notation $\operatorname*{as-lim}_{n}C_{n}=C$ denotes $\mathbb{P}(\lim_{n}C_{n}=C)=1$ . Note $\operatorname*{as-lim\,sup}_{n}C_{n}\subseteq C$ and $\operatorname*{as-lim\,inf}_{n}C_{n}\supseteq C$ if and only if $\operatorname*{as-lim}_{n}C_{n}=C$ , since a countable intersection of almost sure events occurs almost surely.

2.4 Distances and Lipschitz Continuity

Let $d(x,C)=\inf_{y\in C}\|x-y\|$ and $d^{2}(x,C)=\inf_{y\in C}\|x-y\|^{2}$ be the distance and squared distance, respectively, from a point $x$ to set $C$ . The support function of $C$ is $h(x,C)=\sup_{y\in C}x^{\mathsf{T}}y$ . We also define the indicator function $\delta(x,C)$ to equal [math] when $x\in C$ and $+\infty$ when $x\notin C$ . The (integrated) set distance between $C$ and $D$ is defined as ${\ooalign{$ d $\cr$ \mkern 6.8mul $}}(C,D)=\int_{0}^{\infty}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{r}(C,D)e^{-r}dr$ , where the pseudo-distance between sets $C$ and $D$ is given by ${\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{r}(C,D)=\max_{\|x\|\leq r}\big{|}d(x,C)-d(x,D)\big{|}$ . Note ${\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{r}(\{x\},C)\neq d(x,C)$ for all $r$ . The integrated set distance

$d$

$\mkern 6.8mul$ is a metric that characterizes the convergence defined earlier for sets in $\mathcal{F}$ , and the Pompeiu-Hausdorff distance ${\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}$ is a metric that characterizes the convergence defined earlier for sets in $\mathcal{K}$ . Since these metrics are complex, the sequence characterization of convergence is arguably more natural for sets.

One exception to this statement is in defining Lipschitz continuity for set-valued functions. A set-valued function $F:X\mathbin{\ooalign{$ \scriptstyle\rightarrow $\cr\raise 3.22916pt\hbox{$ \scriptstyle\rightarrow $}}}U$ is Lipschitz continuous on $X$ with constant $\kappa\in\mathbb{R}_{+}$ if it is nonempty, closed-valued and such that

[TABLE]

where $\mathbb{B}=\{u:\|u\|\leq 1\}$ is the unit ball. A set-valued set function $G:V\Rightarrow U$ is Lipschitz continuous on $V$ with constant $\kappa\in\mathbb{R}_{+}$ if it is nonempty, closed-valued and

[TABLE]

We will make use of Lipschitz continuity as a zeroth-order local approximation.

3 Mathematical Tools for Statistics with Set-Valued Functions

This section develops mathematical tools that allow us to interpret stochastic convergence and the expectation of random sets as operators. We prove results on the limit of sequences of sets under different set operations, define an expectation for random sets, and then derive results about the behavior of this expectation under different set operations. We conclude this section by briefly summarizing a law of large numbers and a central limit theorem for random sets.

3.1 Stochastic Limit Theorems

Our reason for considering set-valued set functions is this allows us to more precisely generalize the classical continuous mapping theorem of statistics bickel2006 to mappings applied to sequences of sets. Because semicontinuity is an important aspect of set convergence, a generalization that considers semicontinuity leads to a richer set of results than simply considering continuity.

Theorem 3.1 (Semicontinuous Mapping Theorem)

Let $G$ be a set-valued set function, and suppose $\operatorname*{as-lim}_{n}C_{n}=C$ . There are three cases:

$\mathrm{(\mathtt{})}$

If $G$ is osc at $C$ , then $\operatorname*{as-lim\,sup}_{n}G(C_{n})\subseteq G(C)$ . 2. $\mathrm{(\mathtt{})}$

If $G$ is isc at $C$ , then $\operatorname*{as-lim\,inf}_{n}G(C_{n})\supseteq G(C)$ . 3. $\mathrm{(\mathtt{})}$

If $G$ is continuous at $C$ , then $\operatorname*{as-lim}_{n}G(C_{n})=G(C)$ .

Proof

The definition of osc (isc) means $\lim_{n}C_{n}=C$ implies $\limsup_{n}G(C_{n})\subseteq G(C)$ ( $\liminf_{n}G(C_{n})\subseteq G(C)$ ). This means $\mathbb{P}(\limsup_{n}G(C_{n})\subseteq G(C))\geq\mathbb{P}(\lim_{n}C_{n}=C)=1$ ( $\mathbb{P}(\liminf_{n}G(C_{n})\supseteq G(C))\geq\mathbb{P}(\lim_{n}C_{n}=C)=1$ ), which shows the first two cases. The third case follows from the first two cases by recalling that continuity at $C$ is equivalent to being both osc and isc at $C$ .∎

Remark 1

One consequence is that the set-valued function $S(\theta)$ parametrized by $\theta$ has the behavior that $\operatorname*{as-lim}_{n}\theta_{n}=\theta_{0}$ implies $\operatorname*{as-lim}_{n}S(\theta_{n})=S(\theta_{0})$ only when the set is continuous with respect to the parametrization. For example, consider $S(\theta)=\{1\}$ if $\theta>0$ , $S(\theta)=\{-1\}$ if $\theta<0$ , and $S(\theta)=[-1,1]$ if $\theta=0$ . If $\theta_{n}=1/n$ , then $S(\theta_{n})\equiv\{1\}$ and so $\operatorname*{as-lim}_{n}S(\theta_{n})=\{1\}$ . But $\operatorname*{as-lim}_{n}\theta_{n}=0$ and $S(0)=[-1,1]$ .

As is customary in statistics, we immediately get some useful corollaries to our semicontinuous mapping theorem by applying the theorem to specific mappings. Our first corollary applies the semicontinuous mapping theorem to set operations like unions and intersections of sets, the boundary of sets, the convex hull of sets, etc.

Corollary 1

Let $C_{n},D_{n}\in\mathcal{F}$ be almost surely convergent sequences of sets (i.e., $\operatorname*{as-lim}_{n}C_{n}=C$ and $\operatorname*{as-lim}_{n}D_{n}=D$ ). Then we have:

$\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}(C_{n}\cup D_{n})=C\cup D$ ** 2. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,sup}_{n}(C_{n}\cap D_{n})\subseteq C\cap D$ ** 3. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,inf}_{n}\mathrm{cl}(C_{n}^{\ \mathsf{c}})\supseteq\mathrm{cl}(C^{\mathsf{c}})$ ** 4. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,inf}_{n}\partial C_{n}\supseteq\partial C$ ** 5. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,inf}_{n}\mathrm{co}(C_{n})\supseteq\mathrm{co}(C)$ ** 6. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}\mathrm{co}(C_{n})=\mathrm{co}(C)$ , when there is a deterministic $C_{0}\in\mathcal{K}$ so $C_{n}\subseteq C_{0}$ a.s.

Proof

We interpret $\cup,\cap$ as set-valued set functions with a domain over the product space $\mathcal{F}\times\mathcal{F}$ : The function $G_{1}(S,T)=S\cup T$ is continuous matheron1975 , and the function $G_{2}(S,T)=S\cap T$ is osc matheron1975 . The set complement and boundary operators can be interpreted as set-valued set functions with domain $\mathcal{F}$ : The function $G_{3}(S)=\mathrm{cl}(S^{\mathsf{c}})$ is isc matheron1975 , and the function $G_{4}(S)=\partial S$ is isc matheron1975 . The convex hull operation can be cast as set-valued set functions: $G_{5}(S)=\mathrm{co}(S)$ is isc when the domain is $\mathcal{F}$ , and $G_{6}(S)=\mathrm{co}(S)$ is continuous when the domain is $C_{0}$ matheron1975 . The results now follow from the corresponding parts of the semicontinuous mapping theorem. ∎

Remark 2

Note the above result states that the stochastic limit of the convex hull operator is sensitive to the domain of the sequence of sets.

We can also apply the semicontinuous mapping theorem to the Minkowski set operations. These results are useful for proving convergence of statistical estimators.

Corollary 2

Let $C_{n},D_{n}\in\mathcal{F}$ be almost surely convergent sequences of sets (i.e., $\operatorname*{as-lim}_{n}C_{n}=C$ and $\operatorname*{as-lim}_{n}D_{n}=D$ ), and let $\Psi_{n}$ be an almost surely convergent (in the Frobenius norm) sequence of matrices or scalars (i.e., $\operatorname*{as-lim}_{n}\Psi_{n}=\Psi$ ). If there exists a deterministic $D_{0}\in\mathcal{K}$ so $D_{n}\subseteq D_{0}$ a.s., then

$\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}(C_{n}\oplus D_{n})=C\oplus D$ ** 2. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,sup}_{n}(C_{n}\ominus D_{n})\subseteq C\ominus D$ , when $D\neq\emptyset$ 3. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}\Psi_{n}\cdot D_{n}=\Psi\cdot D$ ** 4. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}\Psi^{-1}_{n}D_{n}^{\vphantom{-1}}=\Psi^{-1}_{\vphantom{n}}D$ , when $\Psi$ is invertible

Proof

We interpret $\oplus,\ominus$ as set-valued set functions with a domain over the product space $\mathcal{F}\times\mathcal{K}$ : The function $G_{1}(S,T)=S\oplus T$ is continuous matheron1975 , and the function $G_{2}(S,T)=S\ominus T$ is osc if $T\neq\emptyset$ matheron1975 . So the first two results follow from Theorem 3.1. The multiplication operation can be interpreted as a set-valued set function $G_{3}(S,T)=T\cdot S$ with domain over the product space $\mathbb{M}\times D_{0}$ , where $\mathbb{M}$ is the space of matrices of appropriate dimension or the space of scalars. We show it is continuous. Suppose $G_{3}$ is not osc at $\overline{S}\times\overline{T}$ ; then there exist $T_{n}\rightarrow\overline{T}$ , $S_{n}\rightarrow\overline{S}$ , and $u_{n}\rightarrow\overline{u}$ with $T_{n}\in\mathbb{M}$ , $S_{n}\in\mathcal{F}$ , $u_{n}\in T_{n}\cdot S_{n}$ , and $\overline{u}\notin\overline{T}\cdot\overline{S}$ . But by the definition of matrix-set (or scalar-set) multiplication there exists $v_{n}\in S_{n}$ with $u_{n}=T_{n}\cdot v_{n}$ , and by the boundedness by assumption of $D_{0}$ there exist $n_{k}$ and $\overline{v}$ such that $v_{n_{k}}\rightarrow\overline{v}$ with $\overline{v}\in\overline{S}$ , which is a contradiction since matrix-vector (or scalar-vector) multiplication is osc. Thus $G_{3}$ is osc. Next, we show $G_{3}$ is isc at $\overline{T}\cdot\overline{S}$ : Consider any $\overline{x}\in\overline{S}$ and $u=\overline{T}\cdot\overline{x}$ , and let $T_{n},S_{n}$ be any sequences satisfying $T_{n}\rightarrow\overline{T}$ and $S_{n}\rightarrow\overline{S}$ . By the inner limit definition there exists $x_{n}\rightarrow\overline{x}$ with $x_{n}\in S_{n}$ , and so $T_{n}\cdot x_{n}\rightarrow\overline{T}\cdot\overline{x}$ with $T_{n}\cdot x_{n}\in T_{n}\cdot S_{n}$ . So $G_{3}$ satisfies the definition of being isc at $\overline{T}\cdot\overline{S}$ , and is continuous since it is also osc. The third result follows from Theorem 3.1. The fourth result is proved by noting Theorem 3.1 implies $\operatorname*{as-lim}_{n}\Psi_{n}^{-1}=\Psi_{\vphantom{n}}^{-1}$ since the matrix inverse operation is continuous except at points of singularity, and so $\operatorname*{as-lim}_{n}\Psi^{-1}_{n}C_{n}^{\vphantom{-1}}=\Psi^{-1}_{\vphantom{n}}C$ by the third result. ∎

Our final results on stochastic limits are not based on the semicontinuous mapping theorem, but are nevertheless useful for writing stochastic convergence proofs.

Lemma 1 (Sandwich Lemma)

Let $L_{n}\in\mathcal{F}$ and $U_{n}\in\mathcal{F}$ be almost surely convergent sequences of sets (i.e., $\operatorname*{as-lim}_{n}L_{n}=L$ and $\operatorname*{as-lim}_{n}U_{n}=U$ ), and let $C_{n}\in\mathcal{F}$ be a sequence of sets. Then we have

$\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,sup}_{n}C_{n}\subseteq U$ , when $C_{n}\subseteq U_{n}$ a.s. 2. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim\,inf}_{n}C_{n}\supseteq L$ , when $C_{n}\supseteq L_{n}$ a.s. 3. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}C_{n}=L=U$ , when $L_{n}\subseteq C_{n}\subseteq U_{n}$ a.s. and $L=U$

Proof

For the first two results, note $\operatorname*{as-lim\,sup}_{n}C_{n}\subseteq\operatorname*{as-lim\,sup}_{n}U_{n}=\operatorname*{as-lim}_{n}U_{n}=U$ and $\operatorname*{as-lim\,inf}_{n}C_{n}\supseteq\operatorname*{as-lim\,inf}_{n}L_{n}=\operatorname*{as-lim}_{n}L_{n}=L$ . The third result follows from the first two results and the definition of limit. ∎

This sandwich lemma is valuable for statistical analysis, and we next present a convergence result that is helpful in proving statistical consistency.

Corollary 3

Let $C_{n},D_{n}\in\mathcal{F}$ be sequences of sets, with $D_{n}\subseteq r_{n}\mathbb{B}$ for a sequence $r_{n}\in\mathbb{R}_{+}$ . If $\operatorname*{as-lim}_{n}r_{n}=0$ and $\operatorname*{as-lim}_{n}C_{n}\oplus D_{n}$ exists, then $\operatorname*{as-lim}_{n}C_{n}=\operatorname*{as-lim}_{n}C_{n}\oplus D_{n}$ .

Proof

Consider any $\overline{c}\in\operatorname*{as-lim\,sup}_{n}C_{n}$ , and note that by the outer limit definition there exist $n_{k}$ and $c_{n_{k}}\in C_{n_{k}}$ such that $c_{n_{k}}\rightarrow\overline{c}$ . Thus $c_{n_{k}}+d_{n_{k}}\rightarrow\overline{c}$ for any $d_{n}\in D_{n}$ since by assumption $d_{n}\rightarrow 0$ . This means $\operatorname*{as-lim\,sup}_{n}C_{n}\subseteq\operatorname*{as-lim\,sup}_{n}C_{n}\oplus D_{n}=\operatorname*{as-lim}_{n}C_{n}\oplus D_{n}$ , where the equality holds since $\operatorname*{as-lim}_{n}C_{n}\oplus D_{n}$ exists. Next, choose any $\overline{u}\in\operatorname*{as-lim}_{n}C_{n}\oplus D_{n}$ . By the inner limit definition there exists $u_{n}\in C_{n}\oplus D_{n}$ such that $u_{n}\rightarrow\overline{u}$ , and so by the Minkowski sum definition there exist $c_{n}\in C_{n}$ and $d_{n}\in D_{n}$ such that $u_{n}=c_{n}+d_{n}$ or equivalently that $c_{n}=u_{n}-d_{n}$ . Since by assumption $d_{n}\rightarrow 0$ , this means $c_{n}\rightarrow\overline{u}$ . Thus $\operatorname*{as-lim\,inf}_{n}C_{n}\supseteq\operatorname*{as-lim}_{n}C_{n}\oplus D_{n}$ . The result follows by noting $\operatorname*{as-lim\,inf}_{n}C_{n}\subseteq\operatorname*{as-lim\,sup}_{n}C_{n}$ always holds, and combining with the above.∎

3.2 Expectation

Because sets do not form a vector space, defining expectations for random sets is not straightforward. In fact, a number of different definitions have been proposed molchanov2006 that capture different features that might be desired for an expectation operation. One particularly useful definition is the selection expectation kudo1953 ; aumann1965 . This definition for the expectation of random sets is the most well studied because it leads to a corresponding law of large numbers and central limit theorem molchanov2006 .

For a random set $X$ , a selection $\xi$ is a (single-valued) random vector that almost surely belongs to $X$ . We say the selection $\xi$ is integrable if $\mathbb{E}\|\xi\|_{1}$ is finite, where $\|\cdot\|_{1}$ is the usual $\ell_{1}$ -norm. The selection expectation of a random set $X$ is defined as

[TABLE]

where $\mathcal{S}^{1}(X)$ is the set of all integrable selections of $X$ . The random set $X$ is called integrable if $\mathcal{S}^{1}(X)\neq\emptyset$ , and note this property implies $X$ is almost surely non-empty.

The selection expectation is difficult to use because it cannot be computed by taking an integral, as is the case for expectations for objects in a vector space. But since we assume $E$ is Euclidean space, the definition of the selection expectation simplifies and has a sharp characterization molchanov2006 : If the probability space is nonatomic and $X$ is a bounded and closed integrable random set, then $\mathbb{E}(X)=\{\mathbb{E}\xi:\xi\in\mathcal{S}^{1}(X)\}$ is a compact set, $\mathbb{E}(X)$ is convex, $\mathbb{E}(X)=\mathbb{E}(\mathrm{co}(X))$ , and $h(u,\mathbb{E}(X))=\mathbb{E}(h(u,X))$ for all $u\in E$ , where $h$ is the support function. This support function characterization is powerful, and allows us to prove several properties about the selection expectation. More importantly, the following results allow us to operationalize the selection expectation, which is useful from a practical standpoint for performing statistical analysis.

Proposition 1

Suppose $C,D$ are bounded and closed integrable random sets, and let $\Psi$ be a random matrix or a random scalar. If the probability space is nonatomic, then

$\mathrm{(\mathtt{})}$

$\mathbb{E}(C)=\mathrm{co}(C)$ , when $C$ is deterministic 2. $\mathrm{(\mathtt{})}$

$\mathbb{E}(C\oplus D)=\mathbb{E}(C)\oplus\mathbb{E}(D)$ ** 3. $\mathrm{(\mathtt{})}$

$\mathbb{E}(\Psi C)=\mathbb{E}(\Psi)\cdot\mathbb{E}(C)$ , when $\Psi$ is independent of $C$ 4. $\mathrm{(\mathtt{})}$

$\mathbb{E}(C)\subseteq\mathbb{E}(D)$ , when $C\subseteq D$ a.s. 5. $\mathrm{(\mathtt{})}$

$\mathbb{E}(C)\cup\mathbb{E}(D)\subseteq\mathbb{E}(C\cup D)$ ** 6. $\mathrm{(\mathtt{})}$

$\mathbb{E}(C\cap D)\subseteq\mathbb{E}(C)\cap\mathbb{E}(D)$ ** 7. $\mathrm{(\mathtt{})}$

$\mathbb{E}(C\ominus D)\subseteq\mathbb{E}(C)\ominus\mathbb{E}(D)$ , when $C\ominus D$ is a.s. non-empty.

Proof

The first result holds since $\mathbb{E}(X)=\mathbb{E}(\mathrm{co}(X))$ and $h(u,\mathbb{E}(C))=\mathbb{E}(h(u,C))=h(u,C)$ . The next result follows from $h(u,C\oplus D)=h(u,C)+h(u,D)$ schneider1993 , since $h(u,\mathbb{E}(C\oplus D))=\mathbb{E}(h(u,C\oplus D))=\mathbb{E}(h(u,C)+h(u,D))=\mathbb{E}(h(u,C))+\mathbb{E}(h(u,D))=h(u,\mathbb{E}(C))+h(u,\mathbb{E}(D))=h(u,\mathbb{E}(C)\oplus\mathbb{E}(D))$ . The fourth result holds since $h(u,C)\leq h(u,D)$ when $C\subseteq D$ schneider1993 , which implies $h(u,\mathbb{E}(C))=\mathbb{E}(h(u,C))\leq\mathbb{E}(h(u,D))=h(u,\mathbb{E}(D))$ . For the fifth result, note $C\subseteq C\cup D$ and $D\subseteq C\cup D$ . The fourth result gives $\mathbb{E}(C)\subseteq\mathbb{E}(C\cup D)$ and $\mathbb{E}(D)\subseteq\mathbb{E}(C\cup D)$ , which implies $\mathbb{E}(C)\cup\mathbb{E}(D)\subseteq\mathbb{E}(C\cup D)$ . The sixth result follows since combining $C\cap D\subseteq C$ , $C\cap D\subseteq D$ , and the fourth result gives: $\mathbb{E}(C\cap D)\subseteq\mathbb{E}(C)$ and $\mathbb{E}(C\cap D)\subseteq\mathbb{E}(D)$ , which implies $\mathbb{E}(C\cap D)\subseteq\mathbb{E}(C)\cap\mathbb{E}(D)$ . To prove the seventh result, note $(C\ominus D)\oplus D\subseteq C$ schneider1993 . Applying the second and fourth results yields $\mathbb{E}(C\ominus D)\oplus\mathbb{E}(D)\subseteq\mathbb{E}(C)$ , and so $\mathbb{E}(C\ominus D)\subseteq\mathbb{E}(C)\ominus\mathbb{E}(D)$ schneider1993 .

The third result cannot be proved using support functions since $h(x,\Psi C)$ cannot be written in terms of $h(x,C)$ . (If $\Psi=-1$ , then $h(x,\Psi C)=\inf_{y\in C}x^{\textsf{T}}y$ while $h(x,C)=\sup_{y\in C}x^{\textsf{T}}y$ .) Our approach is to show $\mathcal{S}^{1}(\Psi C)=\Psi\mathcal{S}^{1}(C)$ , since this implies $\mathbb{E}(\Psi C)=\{\mathbb{E}(\Psi\xi):\xi\in\mathcal{S}^{1}(C)\}=\{\mathbb{E}(\Psi)\cdot\mathbb{E}(\xi):\xi\in\mathcal{S}^{1}(C)\}=\mathbb{E}(\Psi)\cdot\{\mathbb{E}(\xi):\xi\in\mathcal{S}^{1}(C)\}=\mathbb{E}(\Psi)\cdot\mathbb{E}(C)$ . The inclusion $\mathcal{S}^{1}(\Psi C)\supseteq\Psi\mathcal{S}^{1}(C)$ is obvious by definition. To prove the reverse inclusion, let $\{\xi_{n},n\geq 1\}$ with $\xi_{n}\in\mathcal{S}^{1}(C)$ be the Castaing representation castaing1967multi ; molchanov2006 ; rockafellar2009 of $C$ . Then $\{\Psi\xi_{n},n\geq 1\}$ is the Castaing representation of $\Psi C$ . But by Lemma 1.3 of molchanov2006 , each selection in $\mathcal{S}^{1}(\Psi C)$ can be approximated arbitrarily well by step functions with arguments from $\{\Psi\xi_{n},n\geq 1\}$ . Thus $\mathcal{S}^{1}(\Psi C)\subseteq\Psi\mathcal{S}^{1}(C)$ , and so $\mathcal{S}^{1}(\Psi C)=\Psi\mathcal{S}^{1}(C)$ since both inclusions were shown. ∎

Remark 3

Note the assumptions for part (c) include the cases where: $\Psi$ is deterministic, $C$ is deterministic, or $\Psi$ has positive or negative entries.

Another result used in statistics is Jensen’s inequality bickel2006 , which bounds changing the order of applying an expectation and a convex function to a random variable. Our next result shows we can generalize Jensen’s inequality to set-valued functions.

Proposition 2 (Jensen’s Inequality)

Let $S(u)$ be a graph-convex set-valued function (i.e., $S((1-\lambda)u_{0}+\lambda u_{1})\supseteq(1-\lambda)\cdot S(u_{0})+\lambda\cdot S(u_{1})\text{ for }\lambda\in(0,1))$ , and let $X$ be bounded and closed integrable random set. If $S(\cdot)$ is locally bounded (i.e., $S(B)$ is bounded for every bounded set $B$ ) and continuous, then we have $S(\mathbb{E}(X))\supseteq\mathbb{E}(S(X))$ .

Proof

The selection expectation equals the Debreu expectation under our assumptions molchanov2006 . This means there exists a sequence of random sets $X_{n}$ with the distribution

[TABLE]

such that $\operatorname*{as-lim}X_{n}=X$ , $\mathbb{E}(X)=\lim_{n}\mathbb{E}(X_{n})$ , and $\mathbb{E}(X_{n})=\bigoplus_{i=1}^{n}p_{in}\cdot F_{in}$ . Using the semicontinuous mapping theorem implies $\operatorname*{as-lim}_{n}S(X_{n})=S(X)$ , and so we have equality of the selection expectation and Debreu expectation molchanov2006 . This means that $\mathbb{E}(S(X))=\lim_{n}\mathbb{E}(S(X_{n}))$ and $\mathbb{E}(S(X_{n}))=\bigoplus_{i=1}^{n}p_{in}\cdot S(F_{in})$ . Next note $S(\bigoplus_{i=1}^{n}p_{in}\cdot F_{in})\supseteq\bigoplus_{i=1}^{n}p_{in}\cdot S(F_{in})$ by the graph-convexity of $S(\cdot)$ . Taking the limit of this set relationship gives $S(\mathbb{E}(X))=\lim S(\bigoplus_{i=1}^{n}p_{in}\cdot F_{in})\supseteq\lim_{n}\bigoplus_{i=1}^{n}p_{in}\cdot S(F_{in})=\mathbb{E}(S(X))$ , where we have used the fact that $\lim_{n}S(\bigoplus_{i=1}^{n}p_{in}\cdot F_{in})=S(\mathbb{E}(X))$ by definition of the continuity of the set-valued function $S(\cdot)$ . ∎

Remark 4

Jensen’s inequality is sometimes stated for concave functions, and such a generalization exists for set-valued mappings. If $S(u)$ is a graph-concave set-valued function (i.e., $S((1-\lambda)u_{0}+\lambda u_{1})\subseteq(1-\lambda)\cdot S(u_{0})+\lambda\cdot S(u_{1})\text{ for }\lambda\in(0,1))$ ) and the other assumptions of the above theorem hold, then we have $S(\mathbb{E}(X))\subseteq\mathbb{E}(S(X))$ .

Lastly, we present a strong law of large numbers (SLLN) for the selection expectation. The key idea is the Minkowski sum takes the role of averaging.

Theorem 3.2 (Artstein and Vitale, 1975 artstein1975 )

Suppose the probability space is non-atomic. If $X,X_{i}$ , $i\geq 1$ , are i.i.d. bounded and closed integrable random sets, then we have that: $\operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}X_{i}=\mathbb{E}(X)$ .

This particular strong law of large numbers can be generalized in a number of ways, and a survey of the different generalizations possible can be found in molchanov2006 .

3.3 Central Limit Theorems

Unlike laws of large numbers that relate convergence of Minkowski sums of i.i.d. random sets $\frac{1}{n}\bigoplus_{i=1}^{n}X_{i}$ to their selection expectation $\mathbb{E}(X)$ , analogs of the central limit theorem (CLT) relating Minkowski sums and selection expectations are less well-understood. One major impediment is that the $\ominus$ operator does not generally invert the $\oplus$ operator, which means it is generally not possible to normalize (in the sense of having a zero mean) the Minkowski sum $\frac{1}{n}\bigoplus_{i=1}^{n}X_{i}$ . As a result, the standard approach to generalizing the central limit theorem is to normalize by instead considering the Hausdorff distance between Minkowski sum and the selection expectation.

Theorem 3.3 (Weil, 1982 weil1982 )

Suppose the probability space is nonatomic. If $X,X_{i}$ , $i\geq 1$ , are i.i.d. bounded and closed integrable random sets, then we have that: $\sqrt{n}\cdot{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(\frac{1}{n}\bigoplus_{i=1}^{n}X_{i},\mathbb{E}(X))\rightarrow\sup_{u}\{\|\zeta(u)\|\ |\ \|u\|\leq 1\}$ in distribution, where $\zeta(u)$ for $\|u\|\leq 1$ is a centered Gaussian random field with covariance given by: $\mathbb{E}(\zeta(u)\cdot\zeta(v))=\mathbb{E}(h(u,X)\cdot h(v,X))-\mathbb{E}(h(u,X))\cdot\mathbb{E}(h(v,X))$ .

The difficulty with this central limit theorem is that it lacks a clear geometrical interpretation (in contrast to the classical central limit theorem for random variables) for the limiting distribution, and the question of whether such a geometrical interpretation exists remains open molchanov2006 . However, one advantage of this formulation is that it lends itself to a generalization of the Delta method bickel2006 from statistics.

Proposition 3 (Approximate Delta Method)

Suppose $r_{n}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(C_{n},C)\rightarrow w$ in distribution, where $r_{n}$ is a strictly increasing sequence, $C_{n}\in\mathcal{K}$ is a sequence of random sets, $C\in\mathcal{K}$ is a deterministic set, and $w$ is a random variable. If $S$ is a Lipschitz continuous set-valued set function, then

[TABLE]

where $\kappa\in\mathbb{R}_{+}$ is the Lipschitz constant of $S$ .

Proof

Lipschitz continuity of $S$ gives $r_{n}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(S(C_{n}),S(C))\leq\kappa\cdot r_{n}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(C_{n},C)$ . Thus $\mathbb{P}(r_{n}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(S(C_{n}),S(C))\geq u)\leq\mathbb{P}(\kappa\cdot r_{n}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(C_{n},C)\geq u)$ . The limit superior of both sides gives the result since $r_{n}{\ooalign{$ d $\cr$ \mkern 6.8mul $}}_{\infty}(C_{n},C)\rightarrow w$ in distribution. ∎

Remark 5

The Delta method relates asymptotic distributions of random variables under differentiable functions bickel2006 , and the intuition is the derivative is used as a local approximation of the function. The above result demonstrates one instance where Lipschitz continuity can be used as a local approximation for set-valued mappings.

Though $\ominus$ does not generally invert $\oplus$ , there is one special case when inversion is possible. If $C,D$ are compact convex sets, then $(C\oplus D)\ominus D=C$ schneider1993 . Using this property, we describe a new central limit theorem for random sets with a particular structure that is useful for statistical applications. Specifically, this result applies to randomly translated sets (RaTS), which are random sets of the form $C=K\oplus\xi$ , where $K$ is a deterministic compact convex set, and $\xi$ is a (vector-valued) random variable.

Theorem 3.4 (Central Limit Theorem for RaTS)

Suppose the probability space is nonatomic, and that $X,X_{i}$ , $i\geq 1$ , are i.i.d. random sets with $X_{i}=K\oplus\xi_{i}$ , where $K$ is a deterministic compact convex set and $\xi,\xi_{i}$ , $i\geq 1$ , are i.i.d. (vector-valued) random variables with zero mean and finite variance. Then

[TABLE]

in distribution, where $\mathcal{N}(0,\mathbb{E}(\xi\xi^{\mathsf{T}}))$ is a jointly Gaussian random variable with zero mean and covariance matrix given by $\mathbb{E}(\xi\xi^{\mathsf{T}})$ .

Proof

Since $(\frac{1}{n}\bigoplus_{i=1}^{n}X_{i})\ominus\mathbb{E}(X)=((\frac{1}{n}\sum_{i=1}^{n}\xi_{i})\oplus K)\ominus K=\frac{1}{n}\sum_{i=1}^{n}\xi_{i}$ , the result follows by the classical central limit theorem bickel2006 .∎

The benefit of this new formulation of the central limit theorem is that it has a clear geometrical interpretation like the classical central limit theorem for random variables, but unfortunately this result only applies to the specific class of RaTS.

4 Kernel Regression

We will construct a nonparametric estimator for set-valued functions using an approach that can be viewed as a natural generalization of kernel regression methods for functions aswani2011 ; aswani2013 ; bickel1982 ; noda1976 ; wand1994 . These techniques are considered nonparametric because, in contrast to parametric models with a finite number of parameters, the number of parameters in nonparametric models increases as the amount of data increases.

4.1 Problem Setup

Consider a Lipschitz continuous set-valued function $S(u):U\rightrightarrows\mathbb{R}^{q}$ with random samples $(X_{i},S_{i})\in U\times\mathbb{R}^{q}$ for $i=1,\ldots,n$ , where: $U\subseteq\mathbb{R}^{d}$ is a convex compact set; $S(u)$ is a convex compact set for each $u\in U$ ; $X_{i}$ are i.i.d. (vector-valued) random variables with a Lipschitz continuous density function $f_{X}$ that has the property $f_{X}(u)>0$ for $u\in U$ ; and $S_{i}=S(X_{i})\oplus W_{i}$ with $W_{i}$ i.i.d. (vector-valued) random variables that have zero mean $\mathbb{E}(W)=0$ and finite variance $\|\mathbb{E}(WW^{\mathsf{T}})\|<+\infty$ . The problem is to estimate $S(u)$ at any $u\in U$ using the above described samples, and we need convexity of $U$ to ensure its tangent cone is derivable at $\partial U$ rockafellar2009 ; however, our results will hold for all $u\in\mathrm{int}(U)$ unconditional of any such regularity assumptions.

4.2 Kernel Functions

Kernel regression is so named because these approaches use kernel functions $\varphi:\mathbb{R}\rightarrow\mathbb{R}$ , which are functions that are non-negative, bounded, even (i.e., $\varphi(-u)=\varphi(u)$ ), and have finite support (i.e., there is a constant $\eta\in(0,1)$ such that $\varphi(u)>0$ when $|u|\leq\eta$ , and $\varphi(u)=0$ for $|u|\geq 1$ ). One example of a kernel function is the indicator function $\varphi(u)=(1/2)\cdot\mathbf{1}(|u|<1)$ , and another example is the Epanechnikov kernel $\varphi(u)=(3/4)\cdot(1-u^{2})\cdot\mathbf{1}(u\leq 1)$ . Notationally, it is useful to define the family of kernel functions $\varphi_{h}(u)=h^{-d}\varphi(\|u\|/h)$ and the function $\gamma(u)=\int_{z\in T_{U}(u)}\varphi_{1}(z)dz$ , where $T_{U}(u)$ is the tangent cone of $U$ at the point $u$ . (Note $\gamma(u)$ is strictly greater than zero and finite because of the assumptions.) We first prove a lemma about $\varphi_{h}(u)$ :

Lemma 2

If $h=n^{-1/(d+4)}$ , then for $u\in U$ we have

$\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)=\gamma(u)\cdot f_{X}(u)$ ** 2. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot w_{i}=0$ ** 3. $\mathrm{(\mathtt{})}$

$\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\|X_{i}-u\|=0$ **

Proof

We prove these three results by verifying the hypothesis of Kolmogorov’s strong law of large numbers holds in each case, then applying this law of large numbers, and finally computing the expectation of the corresponding quantity in each case. To prove the first result, observe that

[TABLE]

where the first inequality holds for some constant $c\in\mathbb{R}^{+}$ because $\varphi_{h}(X_{i}-u)$ is bounded and nonzero with probability at most $s\cdot h^{d}$ for some constant $s\in\mathbb{R}^{+}$ ; and the second inequality holds because $nh^{d}=n^{4/(d+4)}$ . The finiteness of the above summation means we can apply Kolmogorov’s strong law of large numbers, which gives $\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)-\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u))=0$ . Our next step is to compute this expectation. Note that

[TABLE]

where in the last line we made the change of variables $z=(x-u)/h$ . Let $R(h)=(\mathbb{B}\cap(U\oplus\{-u\})/h)\setminus T_{U}(u)$ and $S(h)=(\mathbb{B}\cap T_{U}(u))\setminus(U\oplus\{-u\})/h$ . So we have

[TABLE]

where $\kappa\in\mathbb{R}_{+}$ is the Lipschitz constant of the density $f_{X}(u)$ , and $s\in\mathbb{R}_{+}$ is a constant that exists by continuity of $f_{X}(u)$ . Next note $s\int_{z\in R(h)}dz+s\int_{z\in S(h)}dz\rightarrow 0$ as $h\rightarrow 0$ by Proposition 6.2 and Theorem 4.10 of rockafellar2009 . Thus taking the limit of (14) gives $\lim_{n}\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u))=f_{X}(u)\cdot\int_{z\in\mathbb{R}^{d}}\varphi_{1}(z)dz$ . This proves the first result when combined with the implication of Kolmogorov’s strong law of large numbers in our setting, and after noting $\gamma(u)=\int_{z\in\mathbb{B}\cap T_{U}(u)}\varphi_{1}(z)dz$ since $\varphi_{1}(u)=0$ for $\|u\|>1$ .

For the proof of the second result, let $\langle w\rangle_{j}$ denote the $j$ -th component of the vector $w$ . Next observe that

[TABLE]

where the first inequality holds for some constant $c\in\mathbb{R}^{+}$ because the $w_{i}$ have zero mean and because $\varphi_{h}(X_{i}-u)$ is bounded and nonzero with probability at most $s\cdot h^{d}$ for some constant $s\in\mathbb{R}^{+}$ ; and the second inequality holds because $nh^{d}=n^{4/(d+4)}$ . The finiteness of the above summation means Kolmogorov’s strong law of large numbers gives $\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\langle w_{i}\rangle_{j}-\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\langle w_{i}\rangle_{j})=0$ . But the $w_{i}$ are zero mean, and so we have that $\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\langle w_{i}\rangle_{j})=0$ .

To prove the third result, observe that

[TABLE]

where the first inequality holds for some constant $c\in\mathbb{R}^{+}$ because $U$ is a compact set and because $\varphi_{h}(X_{i}-u)$ is bounded and nonzero with probability at most $s\cdot h^{d}$ for some constant $s\in\mathbb{R}^{+}$ ; and the second inequality holds because $nh^{d}=n^{4/(d+4)}$ . The finiteness of the above summation means we can apply Kolmogorov’s strong law of large numbers, which gives $\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\|X_{i}-u\|-\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\|X_{i}-u\|)=0$ . Our next step is to compute this expectation. Note that

[TABLE]

where the second line makes the change of variables $z=(x-u)/h$ , and the third line holds for some constant $c\in\mathbb{R}^{+}$ because the kernel has finite support and the density is continuous. The above expectation is non-negative, and so $\lim_{n}\mathbb{E}(\frac{1}{n}\sum_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\|X_{i}-u\|)=0$ . This proves the third result when combined with the outcome of Kolmogorov’s strong law of large numbers. ∎

4.3 Kernel Regression Estimator

We define a kernel regression estimate of $S$ at the point $u$ to be

[TABLE]

The following theorem proves the strong pointwise consistency of this estimator.

Theorem 4.1

If $h=n^{-1/(d+4)}$ , then $\operatorname*{as-lim}_{n}\widehat{S}(u)=S(u)$ for $u\in U$ .

Proof

Let $\kappa\in\mathbb{R}_{+}$ be the Lipschitz constant of $S$ , and note that by Lipschitz continuity we have

[TABLE]

Corollary 2 3 and Lemma 2 1 give $\operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot S(u)=\gamma(u)\cdot f_{X}(u)\cdot S(u)$ , and Corollary 2 1 and Lemma 2 2 yield $\operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot(S(u)\oplus w_{i})=\gamma(u)\cdot f_{X}(u)\cdot S(u)$ . Corollary 2 3 and Lemma 2 3 give $\operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\kappa\|X_{i}-u\|\mathbb{B}=0$ , and so Corollary 2 1 implies $\operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S(u)\oplus 2\kappa\|X_{i}-u\|\mathbb{B}\oplus w_{i}\big{)}=\gamma(u)\cdot f_{X}(u)\cdot S(u)$ . So applying the sandwich lemma to (19) yields $\operatorname*{as-lim}_{n}\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S_{i}\oplus\kappa\|X_{i}-u\|\mathbb{B}\big{)}=\gamma(u)\cdot f_{X}(u)\cdot S(u)$ . Corollary 3 gives $\operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot S_{i}=\gamma(u)\cdot f_{X}(u)\cdot S(u)$ . Finally, using Corollary 2 4 and Lemma 2 1 imply that $\operatorname*{as-lim}_{n}\widehat{S}(u)=S(u)$ . ∎

4.4 Algorithms to Compute Kernel Regression Estimator

The statistical consistency of our kernel regression estimator is a theoretical result, and numerical computation of this estimator using the measured data $(u_{i},s_{i})$ for $i=1,\ldots,n$ needs some discussion. The key point is that the corresponding algorithm used to compute the estimator depends on the representation of the sets $s_{i}$ . Since the random sets $S_{i}$ are RaTS, we only need to consider different representations of convex sets. Moreover, we focus our discussion on polytope representations since any compact convex set can be approximated arbitrarily well by polytopes schneider1993 .

If the sets $s_{i}$ are each represented by polynomial time membership oracles, then

[TABLE]

and so membership in the Minkowski sum can be determined in polynomial time. Polynomial time membership oracles exist for $s_{i}$ in a known compact set $G$ , with a self-concordant barrier function for $G$ and the functions defining $s_{i}$ nesterov1994 : The measurement of $s_{i}$ would consist of the function parameters defining $s_{i}$ , and set membership is determined by using interior point to solve a feasibility problem. Examples include polytopes $s_{i}=\{t_{i}:a_{i}t_{i}\leq b_{i}\}$ , with measured data $a_{i},b_{i}$ ; second-order cone sets $s_{i}=\{t_{i}:\|a_{i,j}t_{i}+b_{i,j}\|_{2}\leq c_{i,j}^{\textsf{T}}t_{i}+d_{i,j}\text{ for }j=1,\ldots,k\}$ , with measured data $a_{i,j},b_{i,j},c_{i,j},d_{i,j}$ ; and combinations thereof. Other examples can be found in nesterov1994 .

Next suppose the sets $s_{i}$ are each represented by the zonotopes $s_{i}=\bigoplus_{k=1}^{p}w_{ik}\cdot z_{k}$ , where $w_{ik}$ are weights and $z_{k}$ are vectors, which are polytopes defined as the Minkowski sum of vectors. Restated, the observations are the $w_{ik}$ and $z_{k}$ . Then

[TABLE]

and so the Minkowski sum is polynomial time computable for this representation.

Lastly, suppose the sets $s_{i}$ are represented by the convex hull of a finite set of $p_{i}$ vertices, meaning that $s_{i}=\mathrm{co}(\{v_{i1},\ldots,v_{ip_{i}}\})$ . In this setting the measurements are the vertices of each set $s_{i}$ , and the Minkowski sum is given by

[TABLE]

This is a polynomial time computation since the number of vertices is finite.

4.5 Numerical Example

We conclude our discussion on kernel regression of set-valued functions with a numerical example to visually demonstrate the estimation problem being solved by our estimator. Consider the set-valued function in the bottom-left of Fig. 1, given by

[TABLE]

The $X_{i}$ variables have a $U(-2,2)$ distribution, and each measurement $s_{i}$ is in a vertex representation. The noise $W_{i}$ has a $U(-1,1)$ distribution, meaning its variance is $1/6$ . The top row of Fig. 1 shows measurements for $n=10^{2}$ , $n=10^{3}$ , and $n=10^{4}$ data points, respectively; and the bottom row shows estimates computed by (18) and (22) with an Epanechnikov kernel.111Our code http://ieor.berkeley.edu/~aaswani/code/ssvf.zip runs in a few seconds. This example shows that as the amount of data increases, the estimates $\hat{S}(u)$ converge pointwise to the actual set-valued function $S(u)$ .

5 Inverse Approximate Optimization

Inverse optimization involves computing parameters that make measured solutions optimal ahuja2001 ; aswani2015 ; bertsimas2015 ; chan2014 ; esfahani2015 ; keshavarz2011 . In contrast, the inverse approximate optimization problem makes noisy measurements of suboptimal solutions, and the goal is to estimate the amount of suboptimality and to estimate the parameters of optimization problem generating the data. In principle, the VIA bertsimas2015 and KKT keshavarz2011 estimators can provide estimates of the desired quantities; but we show their estimates are statistically inconsistent. As a result, we construct an estimator for inverse approximate optimization, prove its statistical consistency, and then discuss some possible generalizations.

5.1 Problem Setup

Consider a parametric convex optimization problem

[TABLE]

in which $f,g$ are continuous functions that are convex in $x$ for each fixed value of $u$ and $\theta$ , and assume that for all $u,\theta$ the constraint qualification there exists $x$ such that $g(x,u,\theta)<0$ holds. (Note this constraint representation is fully general since we can write $g=\max g_{i}$ .) We use the definition that $\epsilon$ -optimal solutions are those in the set

[TABLE]

Our results also apply when $\epsilon$ -optimal solutions are defined as in (25) but with $g(x,u,\theta)\leq\epsilon$ . The difference is (25) does not allow any constraint violation, while the alternative definition allows $\epsilon$ constraint violation. Note there are other notions of $\epsilon$ -optimal solutions like distance to the KKT graph, but we do not consider these.

Now suppose $\epsilon$ -optimal solutions of (24) generate random samples $(U_{i},Y_{i})\in D\times\mathbb{R}^{p}$ for $i=1,\ldots,n$ , where: $U_{i}$ are i.i.d. (vector-valued) random variables distributed on the set $D\subseteq\mathbb{R}^{d}$ ; $Y_{i}=X_{i}+W_{i}$ , where $X_{i}$ are i.i.d. (vector-valued) random variables distributed on $S(U_{i},\epsilon_{0},\theta_{0})$ with constants $\epsilon_{0}\in\mathbb{R}_{+}$ and $\theta_{0}\in\mathbb{R}^{p}$ ; and $W_{i}$ are i.i.d. (vector-valued) random variables with zero mean $\mathbb{E}(W_{i})=0$ and distributed on a known convex set $W$ with finite support (which implies finite variance). We also assume the densities of $W_{i},X_{i}$ are strictly positive on the interior of their supports (i.e., $f_{W}(u)>0$ for $u\in\mathrm{int}(W)$ and $f_{X}(u|U_{i})>0$ for $u\in\mathrm{int}(S(U_{i},\epsilon_{0},\theta_{0}))$ ).

The inverse approximate optimization problem is to estimate $(\epsilon_{0},\theta_{0})$ using the $(U_{i},Y_{i})$ for $i=1,\ldots,n$ . Note that we assume the functional forms of $f,g$ are fixed. Let $E\subseteq\mathbb{R}_{+}$ be a known closed set such that $\epsilon_{0}\in E$ , and let $\Theta\subseteq\mathbb{R}^{p}$ be a known compact set such that $\theta_{0}\in\Theta$ ; the intuition is that these sets represent prior knowledge that constrain the parameters and amount of solution suboptimality. The choice $E=\mathbb{R}_{+}$ corresponds to a situation with no such prior knowledge on $\epsilon_{0}$ , and the compactness assumption on $\Theta$ is not restrictive in practice because this set can be made arbitrarily large. (Unbounded $\Theta$ can also be used when a compactification with certain technical properties exists bahadur1971some .) A so-called identifiability condition bickel2006 is also needed. We assume that if $(\epsilon_{0},\theta_{0})\in E\times\Theta$ then $\mathbb{E}(d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W))>0$ for all $\epsilon\in[0,\epsilon_{0})$ and $\theta\in\Theta\setminus\{\theta_{0}\}$ . An identifiability condition (such as the one we have assumed) intuitively says that different parameters of the model produce different outputs.

5.2 Inconsistency of Existing Estimators

The VIA bertsimas2015 (which minimizes the first order suboptimality of the data) and KKT keshavarz2011 (which minimizes the KKT suboptimality of the data) estimators are statistically inconsistent for $\epsilon_{0}=0$ aswani2015 , but since these approaches minimize the amount of suboptimality of the measured data it is initially unclear without further analysis whether these approaches are inconsistent for problem instances with $\epsilon_{0}>0$ . The following result provides qualitative insights into the behavior of these estimators.

Proposition 4

Let $r\in\mathbb{R}_{+}$ be a constant, and suppose $f=x^{2}$ , $g=[x-1;-x-1]$ , $\epsilon_{0}=1$ , $W=\{w:\|w\|\leq r\}$ , and $W_{i},X_{i}$ are uniformly distributed. Then estimates $\hat{\epsilon}$ generated by the VIA bertsimas2015 and KKT keshavarz2011 methods are such that $\operatorname*{as-lim\,inf}_{n}\hat{\epsilon}>r/3$ .

Proof

The KKT estimate is given by

[TABLE]

where these $\Lambda_{i}$ are the minimizers of the below optimization problem

[TABLE]

Since $\operatorname*{\epsilon-arg}\min_{x}\{f(x,u,\theta)\ |\ g(x,u,\theta)\leq 0\}=[-1,1]$ under the hypothesis of this proposition, it holds the $Y_{i}$ are i.i.d. and have triangular distribution with lower limit $-r-1$ , upper limit $r+1$ , and mode [math]. Hence the density of $C_{i}=\langle g(Y_{i})\rangle_{1}^{+}$ is given by

[TABLE]

where $\delta(u)$ is the Dirac delta function. So $\mathbb{E}(C_{i})=\frac{r+1}{3}$ , and the strong law of large numbers implies $\frac{r+1}{3}=\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}C_{i}=\operatorname*{as-lim\,inf}_{n}\frac{1}{n}\sum_{i=1}^{n}C_{i}\leq\operatorname*{as-lim\,inf}_{n}\hat{\epsilon}$ .

The VIA estimate is given by $\hat{\epsilon}=\frac{1}{n}\sum_{i=1}^{n}|\hat{\epsilon}_{i}|$ where $\epsilon_{i}$ are the minimizers to

[TABLE]

However, observe that $2Y_{i}(x_{i}-Y_{i})\geq-\hat{\epsilon}_{i}$ for $x_{i}\in[-1,1]$ simplifies to the constraint $-2(|Y_{i}|+Y_{i}^{2})\geq-\hat{\epsilon}_{i}$ . Since the above optimization is minimizing each $|\hat{\epsilon}_{i}|$ , this means the constraint will be $\hat{\epsilon}_{i}=2(|Y_{i}|+Y_{i}^{2})$ at optimality. Recall that as shown in the proof for KKT, the $Y_{i}$ have a triangular distribution with lower limit $-r-1$ , upper limit $r+1$ , and mode [math]. This means $\mathbb{E}(|\hat{\epsilon}_{i}|)=\frac{2r+2}{3}+\frac{(r+1)^{2}}{9}$ . Applying the strong law of large numbers gives $\frac{2r+2}{3}+\frac{(r+1)^{2}}{9}=\operatorname*{as-lim}_{n}\frac{1}{n}\sum_{i=1}^{n}|\hat{\epsilon}_{i}|=\operatorname*{as-lim}_{n}\hat{\epsilon}$ . ∎

This proposition shows that existing approaches cannot distinguish between noise in measurements versus suboptimality of the solutions. The reason is that these approaches are minimizing an incorrect error metric: They minimize the amount of suboptimality of the measured data, and this is an incorrect error metric when the measured data is noisy because the noise increases the suboptimality of the measured data. Moreover, this indistinguishability of existing approaches is unbounded in the sense that as the noise variance increases then their estimates of suboptimality increase in an unbounded way. Such behavior is undesirable, and in fact the above result gives the following corollary on the statistical properties of VIA and KKT.

Corollary 4

The VIA bertsimas2015 and KKT keshavarz2011 estimators are statistically inconsistent.

Proof

By definition an estimator is consistent for a class of models if and only if it is consistent for each model in that class. Thus to show inconsistency of VIA and KKT it suffices to show inconsistency for a single model. The above proposition establishes inconsistency of VIA and KKT for a particular model because $\epsilon_{0}=1$ while $\operatorname*{as-lim\,inf}_{n}\hat{\epsilon}>r/3$ , meaning these approaches are inconsistent when $r>3$ . ∎

5.3 Approximate Bilevel Programming (ABP) Estimator

To correct the indistinguishability (between suboptimality of solutions and noise in measurements) problem faced by existing approaches, we instead propose an estimator that explicitly models the measured data as consisting of a suboptimal solution added to noise. More specifically, we propose the following statistical estimator

[TABLE]

where $\lambda\in\mathbb{R}_{+}$ and $d^{2}$ is the squared distance function defined in the preliminaries. It is also useful to consider estimators defined as approximate solutions to the above optimization problem. Let $z\in\mathbb{R}_{+}$ be a nonnegative value, and define the estimates

[TABLE]

For notational convenience, we will call this estimator the ABP estimator. Note these estimates are defined as being any $z\operatorname*{-arg}\min$ of the optimization problem (30).

Theorem 5.1

The ABP estimator is strongly statistically consistent, meaning we have $\operatorname*{as-lim}_{n}(\hat{\rule{0.0pt}{6.45831pt}\epsilon},\hat{\theta})=(\epsilon_{0},\theta_{0})$ whenever $\lambda=1/n$ and $\lim_{n}(n\cdot z)=0$ .

Proof

Our first step is to show $d^{2}(y,S(u,\epsilon,\theta)\oplus W)$ satisfies certain continuity properties. Note $\{x:g(x,u,\theta)\leq 0\}$ is continuous by Example 5.10 of rockafellar2009 , and so $V(u,\theta)$ is continuous by the Berge maximum theorem berge1963 . Noting $d^{2}(y,S(u,\epsilon,\theta)\oplus W)=\min\{\|y-\hat{y}\|^{2}\ |\ \hat{y}=\hat{x}+\hat{\epsilon},\hat{\epsilon}\in W,f(\hat{x},u,\theta)\leq V(u,\theta)+\epsilon,g(\hat{x},u,\theta)\leq 0\}$ , we can apply the Berge maximum theorem berge1963 since this feasible set is osc by Example 5.8 of rockafellar2009 : This implies $d^{2}(y,S(u,\epsilon,\theta)\oplus W)$ is lower semicontinuous in $(\epsilon,\theta)$ , and so $\mathbb{E}(d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W))$ is lower semicontinuous in $(\epsilon,\theta)$ by Fatou’s lemma.

Next note that the estimate $(\breve{\rule{0.0pt}{6.45831pt}\epsilon},\breve{\theta})$ also minimizes the optimization problem

[TABLE]

But $n\lambda=1$ by assumption, and so the objective of (32) is nondecreasing in $n$ . Hence $\mathbb{P}(\operatorname*{e-lim}_{n}\sum_{i=1}^{n}d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W)+\epsilon=\sup_{n}\sum_{i=1}^{n}d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W)+\epsilon)=1$ by Proposition 7.4 of rockafellar2009 , where $\operatorname*{e-lim}$ is the epi-limit rockafellar2009 . We next prove that

[TABLE]

and our approach is to use a well-known covering argument originally due to Wald wald1949note . Let $S_{k}$ be a decreasing sequence (i.e., $S_{k}\supseteq S_{k+1}\supseteq\cdots$ ) of open neighborhoods of $(\epsilon_{0},\theta_{0})$ , with $\lim_{k}S_{k}=\{(\epsilon_{0},\theta_{0})\}$ . Since $\mathbb{E}(d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W))$ is lower semicontinuous in $(\epsilon,\theta)$ , this means $\min\{\mathbb{E}(d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W))\ |\ (\epsilon,\theta)\in([0,\epsilon_{0}]\times\Theta)\setminus S_{k}\}>0$ by the identifiability condition. Thus there exists $\nu_{k}>0$ such that $\mathbb{E}(d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W))>2\nu_{k}$ for $(\epsilon,\theta)\in([0,\epsilon_{0}]\times\Theta)\setminus S_{k}$ . By lower semicontinuity of $d^{2}(y,S(u,\epsilon,\theta)\oplus W)$ and the monontone convergence theorem, there exists an open neighborhood $T_{k}(\epsilon,\theta)$ for each $(\epsilon,\theta)\in([0,\epsilon_{0}]\times\Theta)\setminus S_{k}$ so that we have $\mathbb{E}(\inf\{d^{2}(Y_{i},S(U_{i},\epsilon^{\prime},\theta^{\prime})\oplus W)\ |\ (\epsilon^{\prime},\theta^{\prime})\in T_{k}(\epsilon,\theta))>\mathbb{E}(d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W))-\nu_{k}$ . Since $([0,\epsilon_{0}]\times\Theta)\setminus S_{k}$ is compact, there exists a finite set $F_{k}\in([0,\epsilon_{0}]\times\Theta)\setminus S_{k}$ such that $T_{k}(\epsilon,\theta)$ for $(\epsilon,\theta)\in F_{k}$ forms a finite subcover of $([0,\epsilon_{0}]\times\Theta)\setminus S_{k}$ . Combining the above with the Borel-Cantelli lemma implies $\mathbb{P}(\inf\{\sup_{n}\sum_{i=1}^{n}d^{2}(Y_{i},S(U_{i},\epsilon^{\prime},\theta^{\prime})\oplus W)\ |\ (\epsilon^{\prime},\theta^{\prime})\in T_{k}(\epsilon,\theta)\}>0)=1$ for each $(\epsilon,\theta)\in F_{k}$ , which by the finiteness of $F_{k}$ implies that $\mathbb{P}(\sup_{n}\sum_{i=1}^{n}d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W)>0\text{ for }(\epsilon,\theta)\in([0,\epsilon_{0}]\times\Theta)\setminus S_{k}))=1$ . The desired (33) follows since we choose the sequence $S_{k}$ such that $S_{k}\downarrow\{(\epsilon_{0},\theta_{0})\}$ .

Next consider the optimization problem

[TABLE]

Note $(\epsilon_{0},\theta_{0})$ is feasible for both (32) and (34), and so the minimums of (32) and (34) are both less than or equal to $\epsilon_{0}$ . This means $\epsilon>\epsilon_{0}$ cannot minimize (34). Furthermore, using (33) implies that almost surely the (unique) minimizer of (34) is $(\epsilon_{0},\theta_{0})$ , and almost surely the minimum value of (34) is $\epsilon_{0}$ . But from the argument in the preceding paragraph, (32) epi-converges almost surely to (34) since $E,\Theta$ are fixed. The result now follows from Theorem 7.33 of rockafellar2009 . ∎

The above result concerns almost sure convergence of the ABP estimates $(\hat{\rule{0.0pt}{6.45831pt}\epsilon},\hat{\theta})$ to the actual parameters $(\epsilon_{0},\theta_{0})$ , but a related question is whether the corresponding solution set estimates $S(u,\hat{\rule{0.0pt}{6.45831pt}\epsilon},\hat{\theta})$ converge to the actual solution sets $S(u,\epsilon_{0},\theta_{0})$ . Our semicontinuous mapping theorem can be used to establish almost sure convergence of the solution set estimates, and this argument leads to the the following corollary.

Corollary 5

We have that $\operatorname*{as-lim\,sup}_{n}S(u,\hat{\rule{0.0pt}{6.45831pt}\epsilon},\hat{\theta})\subseteq S(u,\epsilon_{0},\theta_{0})$ for $u\in D$ . If $\epsilon_{0}>0$ or $f(\cdot,u,\theta)$ is strictly convex in $x$ , then $\operatorname*{as-lim}_{n}S(u,\hat{\rule{0.0pt}{6.45831pt}\epsilon},\hat{\theta})=S(u,\epsilon_{0},\theta_{0})$ for $u\in D$ .

Proof

The above proof established that $S(u,\epsilon,\theta)$ is osc in $\epsilon,\theta$ . And so the first part of the corollary follows by the semicontinuous mapping theorem. If $\epsilon_{0}>0$ then $S(u,\epsilon_{0},\theta_{0})$ is continuous at $(\epsilon_{0},\theta_{0})$ by Example 5.10 of rockafellar2009 . If $f(\cdot,u,\theta)$ is strictly convex in $x$ , then $S(u,0,\theta_{0})$ is single-valued rockafellar2009 . Hence $S(u,0,\theta_{0})$ is continuous because a single-valued, osc, and locally bounded function is continuous rockafellar2009 . Thus the second part of the corollary follows from the semicontinuous mapping theorem.∎

5.4 Algorithms to Compute ABP Estimator

We next discuss numerical computation of ABP using the data $(u_{i},y_{i})$ for $i=1,\ldots,n$ . The ABP estimator is an approximate (i.e., the solution sets have $\epsilon$ possibly greater than zero) bilevel program, which are optimization problems where some decision variables are solutions to optimization problems that are called the lower level problem. One approach to solve bilevel programs replaces the lower level problem with its KKT conditions allende2013 ; dempe2012 , and this can sometimes be rewritten as mixed-integer programs that may be numerically solved quickly aswani2016_wl . Another approach upper bounds the objective function of the lower level problem by its value function outrata1990 ; ye1995 .

Here we describe how a third approach that upper bounds the objective function of the lower level problem by its dual function aswani2015 ; aswani2016 can be used to compute the ABP estimator. If $h(u,\theta,\lambda)$ is the Lagrangian dual function corresponding to (24), then under mild conditions ensuring zero duality gap the ABP estimator is given by

[TABLE]

This duality-based reformulation can be numerically solved by two different algorithms aswani2015 ; aswani2016 , which we briefly describe here. More details can be found in the corresponding references, and both algorithms assume the sets $E,\Theta$ are compact.

Since the reformulation (35) is a convex optimization problem for fixed $(\epsilon,\theta)$ , one algorithm aswani2015 for computing ABP is to: discretize the set $E\times\Theta$ into a finite set $\Delta=\{(\epsilon_{1},\theta_{1}),\ldots,(\epsilon_{k},\theta_{k})\}$ such that it forms a set covering with balls of a prescribed radius, compute the minimum objective function value of (35) for $(\epsilon,\theta)\in\Delta$ (which we call $Q(\epsilon,\theta)$ ), and then choose estimates $\textstyle(\hat{\rule{0.0pt}{6.45831pt}\epsilon},\hat{\theta})=\arg\min\{Q(\epsilon,\theta)\ |\ (\epsilon,\theta)\in\Delta\}$ . A result from aswani2015 implies that estimates chosen using this enumeration algorithm satisfy the assumptions of Theorem 5.1, which is sufficient for statistical consistency.

A second algorithm aswani2016 replaces the Lagrangian dual by a numerically computed dual. Partial dualization is used to define a regularized dual function (RDF)

[TABLE]

Here, $X$ is any compact set defined such that $\{x:\exists(u,\theta)\in U\times\Theta\text{ s.t. }g(x,u,\theta)\leq 0\}\subseteq\mathrm{int}(X)$ . The intuition is that $X$ is a set that contains all the feasible sets of (24) within its interior. When $g$ does not depend on $(u,\theta)$ , we can choose $X=\{x:l_{i}-1\leq x_{i}\leq u_{i}+1\}$ with $u_{i}=\max\{x_{i}\ |\ g(x)\leq 0\}$ and $l_{i}=\min\{x_{i}\ |\ g(x)\leq 0\}$ that are computed by solving convex optimization problems. Many applications of inverse approximate optimization consist of such a setting where the feasible set is independent of the inputs $u$ or the parameters $\theta$ . The benefit of the RDF is it can be numerically computed because it is a convex optimization problem, and that its gradient

[TABLE]

always exists when $\mu>0$ . In contrast, the Lagrangian dual is usually only directionally differentiable but not differentiable. The algorithm proceeds by using a nonlinear numerical solver to solve a sequence of optimization problems in which $\mu$ goes to [math].

A third possibility is a polynomial time approximation algorithm with the property that statistical consistency holds as the amount of samples $n$ increases to infinity. Such an algorithm has been constructed, when $f$ is affine in $\theta$ and $g$ does not depend on $\theta$ , for inverse optimization with noisy data aswani2015 ; it uses kernel regression to pre-smooth the data and then solves a convex problem corresponding to inverse optimization assuming no noise in the pre-smoothed data. Here we sketch a similar algorithm for inverse approximate optimization, and we leave its analysis for future work. Define $\hat{S}(u)=\mathrm{co}(\{y_{i}:\|u_{i}-u\|\leq h\})\ominus W$ for $h\in\mathbb{R}_{+}$ , and choose the data $\hat{x}_{i}$ by sampling from the uniform distribution on $\hat{S}(u_{i})$ . The estimate $(\breve{\rule{0.0pt}{6.45831pt}\epsilon},\breve{\theta})$ is computed by solving (35) with the change that the $g(\hat{x}_{i},u_{i},\theta)\leq 0$ constraints are removed.

5.5 Numerical Example

We next consider a numerical example to visually compare estimates of $S(u,\epsilon_{0},\theta_{0})$ produced by our ABP estimator and the VIA bertsimas2015 and KKT keshavarz2011 estimators. Suppose $x\in\mathbb{R}$ , $f=-(\theta+u)\cdot x$ , $g=[x-2;-x-2]$ , $\epsilon_{0}=1$ , $\theta_{0}=0$ , $W=\{w:\|w\|\leq 1\}$ , $U_{i}$ has a uniform distribution $U(-2,2)$ , $X_{i}$ is uniformly distributed on $S(U_{i},\epsilon_{0},\theta_{0})$ , $E=\{\epsilon:0.1\leq\epsilon\leq 10\}$ , and $\Theta=\{\theta:-2\leq\theta\leq 2\}$ . The solution set $S(u,\epsilon_{0},\theta_{0})$ in this setting is shown in the left column of Fig. 2. Each measurement $(u_{i},y_{i})$ for this example is a point, and the top row of Fig. 2 shows the measurements for $n=10^{1}$ , $n=10^{2}$ , and $n=10^{3}$ data points, respectively. The rows below show the estimated (using the measurements shown above) solution set as computed by ABP, KKT, and VIA, respectively222Our code http://ieor.berkeley.edu/~aaswani/code/ssvf.zip runs in about three hours.. This example shows that as the number of measurements increases, the solution set estimated by ABP (KKT and VIA) converges (does not converge) to the actual solution set. This statistical behavior is expected given our theoretical results on the strong consistency of ABP and the statistical inconsistency of KKT and VIA.

5.6 Related Inverse Optimization Problems

In our problem setup, the measurement noise $W_{i}$ had a distribution with a finite support. However, noise models commonly used in statistics include distributions with unbounded support but finite variance. The canonical example is $W_{i}$ that are jointly Gaussian with zero mean and finite covariance. A heuristic approach for distributions with unbounded support is to use our ABP estimator with the choices of $W=(2\log n)^{1/2}\cdot\Sigma$ for sub-Gaussian distributions (i.e., distributions bounded from above by a jointly Gaussian random variable) and $W=((2\log n)^{1/2}+\log n)\cdot\Sigma$ for sub-exponential distributions (i.e., distributions with exponentially decaying tails), where $\Sigma=\mathbb{E}(W_{i}^{\vphantom{\mathsf{T}}}W_{i}^{\mathsf{T}})$ is the covariance matrix of $W_{i}$ . The reason for this suggested heuristic is these choices of $W$ are analogous to bounds on the maximum expected values of sub-Gaussian and sub-exponential random variables boucheron2013 .

Since the ABP estimator is a heuristic in this setting, an obvious topic is to design a statistically consistent estimator for inverse approximate optimization problems with unbounded noise. Maximum likelihood estimation is arguably the most natural approach because otherwise it is difficult to distinguish between noise and suboptimality of solutions. Specifically, consider the original problem setup but with the changes that the random sample is $(u_{i},X_{i})$ , the $X_{i}$ are uniformly distributed within $S(u_{i},\epsilon_{0},\theta_{0})$ , and that $W_{i}$ is distributed according to some known density $f_{W}(u)$ . Then the maximum likelihood estimator (MLE) for this modified problem setup is given by

[TABLE]

This optimization problem has a challenging structure in which the domains of integration depend upon the decision variables royset2017variational , and presents an opportunity for the further study of designing numerical algorithms to solve such optimization problems. We do note that for fixed $(\epsilon,\theta)$ , the integrals in the objective can be numerically computed in polynomial time using hit-and-run techniques for sampling from convex sets lovasz2006 ; smith1984 . And so the enumeration algorithm we described earlier for the ABP estimator could be easily modified to solve this MLE problem.

Remark 6

The ABP and MLE estimators are actually qualitatively the same. The $\frac{1}{n}\sum_{i=1}^{n}d^{2}(Y_{i},S(U_{i},\epsilon,\theta)\oplus W)$ term in ABP and the $-\frac{1}{n}\sum_{i=1}^{n}\log\int_{x\in S(u_{i},\epsilon,\theta)}f_{W}(Y_{i}-x)dx$ term in MLE both penalize estimates in which the solutions $Y_{i}$ are far from the solution sets $S(\cdot,\epsilon,\theta)$ , and the $\frac{1}{n}\sum_{i=1}^{n}\log\int_{x\in S(u_{i},\epsilon,\theta)}dx$ term in MLE and the $\lambda\cdot\epsilon$ term in ABP both penalize estimates that generate large solution sets.

In the two inverse approximate optimization problem setups considered above, we assumed the approximate solutions $X_{i}$ were drawn from the solution sets $S(U_{i},\epsilon_{0},\theta_{0})$ according to some distribution. However, another modified problem setup would be to assume the $X_{i}$ were chosen from the solution sets by solution of another optimization problem. This kind of setup corresponds to a scenario in which the $X_{i}$ are solutions to an optimistic bilevel optimization problem with unique solutions:

[TABLE]

In this case, the estimation procedure can be posed as a least squares problem

[TABLE]

This is a challenging multi-level optimization problem and presents an opportunity for the further study of designing numerical algorithms to solve such optimization problems. We do note that for fixed $(\epsilon,\theta)$ , this becomes a convex optimization problem. And so the enumeration algorithm we described earlier for the ABP estimator could be easily modified to solve this least squares problem.

6 Conclusion

In this paper, we used variational analysis to develop tools for statistics with set-valued functions, and then applied these tools to two estimation problems. We constructed and studied a kernel regression estimator for set-valued functions and an estimator for the inverse approximate optimization problem. The area of statistics with set-valued functions remains largely unexplored with many remaining problems. One question is the design of numerical representations of sets and set-valued functions. Though constraint representations of sets are pervasive, numerical machinery like epi-splines royset2016 may offer greater representational flexibility. Another question is the development of numerical algorithms to solve optimization problems that arise in statistical estimation for set-valued functions. Related inverse optimization problems lead to formulations (38) and (40) with structures that are not well-studied from the perspective of numerical optimization. Further study of statistics with set-valued functions will require developing new numerical methods and optimization theory.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Ahuja, R., Orlin, J.: Inverse optimization. Operations Research 49 (5), 771–783 (2001)
2(2) Allende, G., Still, G.: Solving bilevel programs with the KKT-approach. Mathematical programming 138 (1-2), 309 (2013)
3(3) Artstein, Z., Vitale, R.: A strong law of large numbers for random compact sets. The Annals of Probability pp. 879–882 (1975)
4(4) Aswani, A., Bickel, P., Tomlin, C.: Regression on manifolds: Estimation of the exterior derivative. The Annals of Statistics pp. 48–81 (2011)
5(5) Aswani, A., Gonzalez, H., Sastry, S., Tomlin, C.: Provably safe and robust learning-based model predictive control. Automatica 49 (5), 1216–1226 (2013)
6(6) Aswani, A., Kaminsky, P., Mintz, Y., Flowers, E., Fukuoka, Y.: Behavioral modeling in weight loss interventions. Available at SSRN: https://ssrn.com/abstract=2838443 (2016)
7(7) Aswani, A., Shen, Z.J., Siddiq, A.: Inverse optimization with noisy data. Operations Research (2017). Accepted
8(8) Aumann, R.J.: Integrals of set-valued functions. Journal of Mathematical Analysis and Applications 12 (1), 1–12 (1965)