Statistics with Set-Valued Functions: Applications to Inverse Approximate Optimization
Anil Aswani

TL;DR
This paper develops a statistical framework for set-valued functions using variational analysis, enabling consistent estimation in inverse approximate optimization with noisy data.
Contribution
It introduces operational tools for statistics with set-valued functions and applies them to inverse approximate optimization, ensuring statistical consistency under noise.
Findings
Previous methods are statistically inconsistent with noisy data.
The proposed approach achieves consistency under mild conditions.
Applications include nonparametric estimation of set-valued functions.
Abstract
Much of statistics relies upon four key elements: a law of large numbers, a calculus to operationalize stochastic convergence, a central limit theorem, and a framework for constructing local approximations. These elements are well-understood for objects in a vector space (e.g., points or functions); however, much statistical theory does not directly translate to sets because they do not form a vector space. Building on probability theory for random sets, this paper uses variational analysis to develop operational tools for statistics with set-valued functions. These tools are first applied to nonparametric estimation (kernel regression of set-valued functions). The second application is to the problem of inverse approximate optimization, in which approximate solutions (corrupted by noise) to an optimization problem are observed and then used to estimate the amount of suboptimality of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: A. Aswani 22institutetext: Industrial Engineering and Operation Research, University of California, Berkeley, CA, USA
22email: [email protected]
Statistics with Set-Valued Functions††thanks: This work was supported in part by NSF Award CMMI-1450963.
Applications to Inverse Approximate Optimization
Anil Aswani
Abstract
Much of statistics relies upon four key elements: a law of large numbers, a calculus to operationalize stochastic convergence, a central limit theorem, and a framework for constructing local approximations. These elements are well-understood for objects in a vector space (e.g., points or functions); however, much statistical theory does not directly translate to sets because they do not form a vector space. Building on probability theory for random sets, this paper uses variational analysis to develop operational tools for statistics with set-valued functions. These tools are first applied to nonparametric estimation (kernel regression of set-valued functions). The second application is to the problem of inverse approximate optimization, in which approximate solutions (corrupted by noise) to an optimization problem are observed and then used to estimate the amount of suboptimality of the solutions and the parameters of the optimization problem that generated the solutions. We show that previous approaches to this problem are statistically inconsistent when the data is corrupted by noise, whereas our approach is consistent under mild conditions.
Keywords:
set-valued functions statistics inverse optimization
††journal: Mathematical Programming
1 Introduction
While statistical theory is well-developed for problems concerning (single-valued) functions bickel2006 ; van2000 , there has been less work on statistics with sets or set-valued functions. Most attention in statistics on sets has been focused on the problem of estimating a single set under different measurement models devroye1980 ; geffroy1964 ; guntuboyina2012 ; korostelev1995 ; patschkowski2016 ; renyi1963 ; scholkopf2001 . The problem of estimating set-valued functions is less well studied, though it has potential applications in varied domains including healthcare, robotics, and energy. For instance, we study in this paper the problem of inverse approximate optimization, where approximate solutions (corrupted by noise) to a parametric optimization problem are observed and then used to estimate the amount of suboptimality of the solutions and the parameters that generated the solutions. Inverse approximate optimization can be used to construct predictive models of human behavior and decision-making, where the explicit model is that an individual makes decisions by approximately solving an optimization problem. Statistical estimation in this context could be used to quantify the tradeoffs made by a particular individual between competing objectives, as well as quantify the predictability of the decision-making process. This particular problem of inverse approximate optimization is related to the broader topic of statistics with set-valued functions because the solution mapping of an (even strictly convex) optimization problem becomes a set when suboptimality of solutions is allowed. Thus a framework for statistics with set-valued functions is needed to study such problems.
A substantial impediment to studying such estimation problems is the lack of statistical tools for random sets and set-valued functions, and two technical issues prevent the use of existing tools. The first is that most statistical theory assumes objects belong to a vector space, which is the case for points and functions. But sets do not form a vector space, and so existing statistical theory cannot be used. This is a fundamental difficulty, and even the usual notion of expectation does not apply to sets molchanov2006 . The second is that most statistical theory has been developed by using metrics and distance functions to derive results. But analyzing sets using distances is difficult, and most analysis tools and results for sets do not use this approach berge1963 ; rockafellar2009 .
Arguably the most natural approach to statistics with random sets is to define a family of sets parametrized by a random vector, and then perform standard statistical analysis with respect to this parametrization. However it is not clear without further analysis whether stochastic convergence of the estimated parameters implies stochastic convergence of the corresponding set estimates. We study this question in a more general framework and give a counterexample to demonstrate how parameter convergence does not always imply set convergence. Moreover, the parametrization approach does not lead to a useful definition for the expectation of random sets molchanov2006 ; the reason is that the expectation of the parameters does not characterize the expectation of the set in a way in that ensures the law of large numbers holds.
One goal of this paper is to establish tools for statistics with set-valued functions, and this requires understanding four main ingredients: a law of large numbers, a calculus to operationalize stochastic convergence, a central limit theorem, and tools for constructing local approximations. Probability theory for random sets molchanov2006 provides an expectation for random sets aumann1965 ; kudo1953 , a law of large numbers artstein1975 , and a central limit theorem weil1982 . Here we use variational analysis rockafellar2009 to advocate a notion of local approximation for set-valued functions, and to develop results that allow us to interpret stochastic convergence and expectations of random sets as operators.
The paper begins by describing our notation and providing some useful definitions related to set-valued functions. We focus in this paper on almost sure (a.s.) convergence because the corresponding definitions and approach most clearly demonstrate the tight link between variational analysis and statistics. Defining set convergence in probability requires metrization, which partially obscures the relationship to variational analysis. We also focus on Lipschitz continuity for set-valued functions because we advocate using this concept as a notion of local approximation for set-valued functions. The utility of this approach is displayed later in the paper when we use Lipschitz continuity as a replacement for differentiability when proving a Delta method-like result and proving statistical consistency of a kernel regression estimator.
The next section shows how to interpret stochastic convergence and expectation of random sets as operators. We study the limit of sequences of sets under different set operations, after proving a set-based generalization of the continuous mapping theorem bickel2006 from statistics. Then we study the expectation of random sets under various set operations. Standard proofs about the properties of the expectation of random variables do not extend because the expectation of a random set cannot be computed by integration. This means properties like distribution of expectation under independence of the product of a random matrix with a random set or Jensen’s inequality have not been previously established, and we prove such results. We conclude by reviewing a law of large numbers and a central limit theorem for random sets.
Another goal of this paper is to study two problems of estimating set-valued functions, and through the process of analyzing these problems we demonstrate the utility of our tools for statistics with set-valued functions. The first problem we study is estimating a set-valued function using noisy measurements of the set. We propose a kernel regression estimator that can be interpreted as a generalization of methods for functions aswani2011 ; aswani2013 ; bickel1982 ; noda1976 ; wand1994 . The key step in proving statistical consistency is using Lipschitz continuity of the set-valued function to construct local approximations. We show that statistical consistency follows by combining our results on stochastic convergence with convergence bounds on (vector-valued) random variables.
The second problem we study is inverse approximate optimization, where noisy measurements of approximate solutions to an optimization problem are used to estimate the suboptimality of the solutions and the parameters of the optimization problem. In contrast, past work on inverse optimization assumes no noise ahuja2001 ; chan2014 ; esfahani2015 or exact solutions aswani2015 ; bertsimas2015 ; keshavarz2011 . We develop a method for inverse approximate optimization and prove its statistical consistency using stochastic epi-convergence aswani2011 ; dupacova1988 ; geyer1994 ; knight2000 ; lachout2005 ; salinetti1986 . Combining with our results on stochastic convergence and results on the continuity of solutions to optimization problems rockafellar2009 ; royset2016b shows our method consistently estimates the (set-valued) approximate solution mapping that generates the data.
We conclude by examining extensions of the problem of inverse approximate optimization, as well as discussing related open questions about statistics with set-valued functions. In particular, we describe how some extensions lead to formulations of optimization problems with structures (e.g., objective functions that are integrals whose domain of integration depends on the decision variable) that have not been well-studied from the perspective of numerical optimization. Performing statistics with sets and set-valued functions also leads to questions about the design of numerical representations of sets. We argue that further study of statistics with set-valued functions will require developing new numerical methods and optimization theory.
2 Preliminaries
This section presents the notation used in this paper, as well as several useful concepts from variational analysis. Most of the variational analysis definitions are from rockafellar2009 . The definition of set-valued set functions is from matheron1975 , and we use the definitions of the Minkowski set operations from schneider1993 . We abbreviate almost surely using a.s.
2.1 Notation
Let be the space of closed subsets of , and let be the space of compact subsets of . We will focus on cases where is a Euclidean space, and so will use the notation to refer to the corresponding spaces. Clearly by definition.
Suppose are sets and is a matrix or scalar. We use the set notation: is the union of ; is the intersection of ; denotes that is a subset of ; denotes that is a superset of ; is the closure of ; is the convex hull of ; is the complement of ; is the boundary of ; is the Minkowski sum of ; is the Minkowski difference of ; ; and .
2.2 Limit Definitions and Set-Valued Mappings
The outer limit of the sequence of sets is defined as
[TABLE]
and the inner limit of the sequence of sets is defined as
[TABLE]
The outer limit consists of all the cluster points of , whereas the inner limit consists of all limit points of . The limit of the sequence of sets exists if the outer and inner limits are equal, and we define that .
Let denote the extended real line. A sequence of extended-real-valued functions is said to epi-converge to if at each we have
[TABLE]
The notion of epi-convergence is so-named because it is equivalent to set convergence of the epigraphs of , meaning that epi-convergence is equivalent to the condition .
A set-valued set function assigns to each set a set . The outer limit of at the set is defined as
[TABLE]
and the inner limit of at the set is defined as
[TABLE]
The intuition is similar to the notions for sequences of sets. The set-valued set function is outer semicontinuous (osc) at if , and is inner semicontinuous (isc) at if . The set-valued set function is continuous at when it is both osc and isc, that is when .
Variational analysis typically uses set-valued functions, rather than set-valued set functions. A set-valued function F:X\mathbin{\ooalign{\scriptstyle\rightarrow\cr\raise 3.22916pt\hbox{\scriptstyle\rightarrow}}}U assigns to each point a set . Outer limits, inner limits, outer semicontinuity, inner semicontinuity, and continuity are defined as above but with points replacing sets in the domain. Moreover, a set-valued function applied pointwise to sets is an osc, isc, continuous set-valued set function whenever the set-valued function is osc, isc, continuous, respectively.
2.3 Probability Definitions and Stochastic Convergence
Let be a complete probability space, where is the sample space, is the set of events, and is the probability measure. A map is a random set if for each in the Borel -algebra on molchanov2006 . Like the usual convention for random variables, we notationally drop the argument for a random set.
When discussing samples for estimation, we use the convention that capital letters denote random variables, and lowercase letters denote measured data. Also, we use the notation to specify a uniform distribution with support .
We next define almost sure stochastic convergence of random sets. The notation denotes , the notation denotes , and the notation denotes . Note and if and only if , since a countable intersection of almost sure events occurs almost surely.
2.4 Distances and Lipschitz Continuity
Let and be the distance and squared distance, respectively, from a point to set . The support function of is . We also define the indicator function to equal [math] when and when . The (integrated) set distance between and is defined as {\ooalign{d\mkern 6.8mul}}(C,D)=\int_{0}^{\infty}{\ooalign{d\mkern 6.8mul}}_{r}(C,D)e^{-r}dr, where the pseudo-distance between sets and is given by {\ooalign{d\mkern 6.8mul}}_{r}(C,D)=\max_{\|x\|\leq r}\big{|}d(x,C)-d(x,D)\big{|}. Note {\ooalign{d\mkern 6.8mul}}_{r}(\{x\},C)\neq d(x,C) for all . The integrated set distance
is a metric that characterizes the convergence defined earlier for sets in , and the Pompeiu-Hausdorff distance {\ooalign{d\mkern 6.8mul}}_{\infty} is a metric that characterizes the convergence defined earlier for sets in . Since these metrics are complex, the sequence characterization of convergence is arguably more natural for sets.
One exception to this statement is in defining Lipschitz continuity for set-valued functions. A set-valued function F:X\mathbin{\ooalign{\scriptstyle\rightarrow\cr\raise 3.22916pt\hbox{\scriptstyle\rightarrow}}}U is Lipschitz continuous on with constant if it is nonempty, closed-valued and such that
[TABLE]
where is the unit ball. A set-valued set function is Lipschitz continuous on with constant if it is nonempty, closed-valued and
[TABLE]
We will make use of Lipschitz continuity as a zeroth-order local approximation.
3 Mathematical Tools for Statistics with Set-Valued Functions
This section develops mathematical tools that allow us to interpret stochastic convergence and the expectation of random sets as operators. We prove results on the limit of sequences of sets under different set operations, define an expectation for random sets, and then derive results about the behavior of this expectation under different set operations. We conclude this section by briefly summarizing a law of large numbers and a central limit theorem for random sets.
3.1 Stochastic Limit Theorems
Our reason for considering set-valued set functions is this allows us to more precisely generalize the classical continuous mapping theorem of statistics bickel2006 to mappings applied to sequences of sets. Because semicontinuity is an important aspect of set convergence, a generalization that considers semicontinuity leads to a richer set of results than simply considering continuity.
Theorem 3.1 (Semicontinuous Mapping Theorem)
Let be a set-valued set function, and suppose . There are three cases:
If is osc at , then . 2.
If is isc at , then . 3.
If is continuous at , then .
Proof
The definition of osc (isc) means implies (). This means (), which shows the first two cases. The third case follows from the first two cases by recalling that continuity at is equivalent to being both osc and isc at .∎
Remark 1
One consequence is that the set-valued function parametrized by has the behavior that implies only when the set is continuous with respect to the parametrization. For example, consider if , if , and if . If , then and so . But and .
As is customary in statistics, we immediately get some useful corollaries to our semicontinuous mapping theorem by applying the theorem to specific mappings. Our first corollary applies the semicontinuous mapping theorem to set operations like unions and intersections of sets, the boundary of sets, the convex hull of sets, etc.
Corollary 1
Let be almost surely convergent sequences of sets (i.e., and ). Then we have:
** 2.
** 3.
** 4.
** 5.
** 6.
, when there is a deterministic so a.s.
Proof
We interpret as set-valued set functions with a domain over the product space : The function is continuous matheron1975 , and the function is osc matheron1975 . The set complement and boundary operators can be interpreted as set-valued set functions with domain : The function is isc matheron1975 , and the function is isc matheron1975 . The convex hull operation can be cast as set-valued set functions: is isc when the domain is , and is continuous when the domain is matheron1975 . The results now follow from the corresponding parts of the semicontinuous mapping theorem. ∎
Remark 2
Note the above result states that the stochastic limit of the convex hull operator is sensitive to the domain of the sequence of sets.
We can also apply the semicontinuous mapping theorem to the Minkowski set operations. These results are useful for proving convergence of statistical estimators.
Corollary 2
Let be almost surely convergent sequences of sets (i.e., and ), and let be an almost surely convergent (in the Frobenius norm) sequence of matrices or scalars (i.e., ). If there exists a deterministic so a.s., then
** 2.
, when 3.
** 4.
, when is invertible
Proof
We interpret as set-valued set functions with a domain over the product space : The function is continuous matheron1975 , and the function is osc if matheron1975 . So the first two results follow from Theorem 3.1. The multiplication operation can be interpreted as a set-valued set function with domain over the product space , where is the space of matrices of appropriate dimension or the space of scalars. We show it is continuous. Suppose is not osc at ; then there exist , , and with , , , and . But by the definition of matrix-set (or scalar-set) multiplication there exists with , and by the boundedness by assumption of there exist and such that with , which is a contradiction since matrix-vector (or scalar-vector) multiplication is osc. Thus is osc. Next, we show is isc at : Consider any and , and let be any sequences satisfying and . By the inner limit definition there exists with , and so with . So satisfies the definition of being isc at , and is continuous since it is also osc. The third result follows from Theorem 3.1. The fourth result is proved by noting Theorem 3.1 implies since the matrix inverse operation is continuous except at points of singularity, and so by the third result. ∎
Our final results on stochastic limits are not based on the semicontinuous mapping theorem, but are nevertheless useful for writing stochastic convergence proofs.
Lemma 1 (Sandwich Lemma)
Let and be almost surely convergent sequences of sets (i.e., and ), and let be a sequence of sets. Then we have
, when a.s. 2.
, when a.s. 3.
, when a.s. and
Proof
For the first two results, note and . The third result follows from the first two results and the definition of limit. ∎
This sandwich lemma is valuable for statistical analysis, and we next present a convergence result that is helpful in proving statistical consistency.
Corollary 3
Let be sequences of sets, with for a sequence . If and exists, then .
Proof
Consider any , and note that by the outer limit definition there exist and such that . Thus for any since by assumption . This means , where the equality holds since exists. Next, choose any . By the inner limit definition there exists such that , and so by the Minkowski sum definition there exist and such that or equivalently that . Since by assumption , this means . Thus . The result follows by noting always holds, and combining with the above.∎
3.2 Expectation
Because sets do not form a vector space, defining expectations for random sets is not straightforward. In fact, a number of different definitions have been proposed molchanov2006 that capture different features that might be desired for an expectation operation. One particularly useful definition is the selection expectation kudo1953 ; aumann1965 . This definition for the expectation of random sets is the most well studied because it leads to a corresponding law of large numbers and central limit theorem molchanov2006 .
For a random set , a selection is a (single-valued) random vector that almost surely belongs to . We say the selection is integrable if is finite, where is the usual -norm. The selection expectation of a random set is defined as
[TABLE]
where is the set of all integrable selections of . The random set is called integrable if , and note this property implies is almost surely non-empty.
The selection expectation is difficult to use because it cannot be computed by taking an integral, as is the case for expectations for objects in a vector space. But since we assume is Euclidean space, the definition of the selection expectation simplifies and has a sharp characterization molchanov2006 : If the probability space is nonatomic and is a bounded and closed integrable random set, then is a compact set, is convex, , and for all , where is the support function. This support function characterization is powerful, and allows us to prove several properties about the selection expectation. More importantly, the following results allow us to operationalize the selection expectation, which is useful from a practical standpoint for performing statistical analysis.
Proposition 1
Suppose are bounded and closed integrable random sets, and let be a random matrix or a random scalar. If the probability space is nonatomic, then
, when is deterministic 2.
** 3.
, when is independent of 4.
, when a.s. 5.
** 6.
** 7.
, when is a.s. non-empty.
Proof
The first result holds since and . The next result follows from schneider1993 , since . The fourth result holds since when schneider1993 , which implies . For the fifth result, note and . The fourth result gives and , which implies . The sixth result follows since combining , , and the fourth result gives: and , which implies . To prove the seventh result, note schneider1993 . Applying the second and fourth results yields , and so schneider1993 .
The third result cannot be proved using support functions since cannot be written in terms of . (If , then while .) Our approach is to show , since this implies . The inclusion is obvious by definition. To prove the reverse inclusion, let with be the Castaing representation castaing1967multi ; molchanov2006 ; rockafellar2009 of . Then is the Castaing representation of . But by Lemma 1.3 of molchanov2006 , each selection in can be approximated arbitrarily well by step functions with arguments from . Thus , and so since both inclusions were shown. ∎
Remark 3
Note the assumptions for part (c) include the cases where: is deterministic, is deterministic, or has positive or negative entries.
Another result used in statistics is Jensen’s inequality bickel2006 , which bounds changing the order of applying an expectation and a convex function to a random variable. Our next result shows we can generalize Jensen’s inequality to set-valued functions.
Proposition 2 (Jensen’s Inequality)
Let be a graph-convex set-valued function (i.e., , and let be bounded and closed integrable random set. If is locally bounded (i.e., is bounded for every bounded set ) and continuous, then we have .
Proof
The selection expectation equals the Debreu expectation under our assumptions molchanov2006 . This means there exists a sequence of random sets with the distribution
[TABLE]
such that , , and . Using the semicontinuous mapping theorem implies , and so we have equality of the selection expectation and Debreu expectation molchanov2006 . This means that and . Next note by the graph-convexity of . Taking the limit of this set relationship gives , where we have used the fact that by definition of the continuity of the set-valued function . ∎
Remark 4
Jensen’s inequality is sometimes stated for concave functions, and such a generalization exists for set-valued mappings. If is a graph-concave set-valued function (i.e., ) and the other assumptions of the above theorem hold, then we have .
Lastly, we present a strong law of large numbers (SLLN) for the selection expectation. The key idea is the Minkowski sum takes the role of averaging.
Theorem 3.2 (Artstein and Vitale, 1975 artstein1975 )
Suppose the probability space is non-atomic. If , , are i.i.d. bounded and closed integrable random sets, then we have that: .
This particular strong law of large numbers can be generalized in a number of ways, and a survey of the different generalizations possible can be found in molchanov2006 .
3.3 Central Limit Theorems
Unlike laws of large numbers that relate convergence of Minkowski sums of i.i.d. random sets to their selection expectation , analogs of the central limit theorem (CLT) relating Minkowski sums and selection expectations are less well-understood. One major impediment is that the operator does not generally invert the operator, which means it is generally not possible to normalize (in the sense of having a zero mean) the Minkowski sum . As a result, the standard approach to generalizing the central limit theorem is to normalize by instead considering the Hausdorff distance between Minkowski sum and the selection expectation.
Theorem 3.3 (Weil, 1982 weil1982 )
Suppose the probability space is nonatomic. If , , are i.i.d. bounded and closed integrable random sets, then we have that: \sqrt{n}\cdot{\ooalign{d\mkern 6.8mul}}_{\infty}(\frac{1}{n}\bigoplus_{i=1}^{n}X_{i},\mathbb{E}(X))\rightarrow\sup_{u}\{\|\zeta(u)\|\ |\ \|u\|\leq 1\} in distribution, where for is a centered Gaussian random field with covariance given by: .
The difficulty with this central limit theorem is that it lacks a clear geometrical interpretation (in contrast to the classical central limit theorem for random variables) for the limiting distribution, and the question of whether such a geometrical interpretation exists remains open molchanov2006 . However, one advantage of this formulation is that it lends itself to a generalization of the Delta method bickel2006 from statistics.
Proposition 3 (Approximate Delta Method)
Suppose r_{n}{\ooalign{d\mkern 6.8mul}}_{\infty}(C_{n},C)\rightarrow w in distribution, where is a strictly increasing sequence, is a sequence of random sets, is a deterministic set, and is a random variable. If is a Lipschitz continuous set-valued set function, then
[TABLE]
where is the Lipschitz constant of .
Proof
Lipschitz continuity of gives r_{n}{\ooalign{d\mkern 6.8mul}}_{\infty}(S(C_{n}),S(C))\leq\kappa\cdot r_{n}{\ooalign{d\mkern 6.8mul}}_{\infty}(C_{n},C). Thus \mathbb{P}(r_{n}{\ooalign{d\mkern 6.8mul}}_{\infty}(S(C_{n}),S(C))\geq u)\leq\mathbb{P}(\kappa\cdot r_{n}{\ooalign{d\mkern 6.8mul}}_{\infty}(C_{n},C)\geq u). The limit superior of both sides gives the result since r_{n}{\ooalign{d\mkern 6.8mul}}_{\infty}(C_{n},C)\rightarrow w in distribution. ∎
Remark 5
The Delta method relates asymptotic distributions of random variables under differentiable functions bickel2006 , and the intuition is the derivative is used as a local approximation of the function. The above result demonstrates one instance where Lipschitz continuity can be used as a local approximation for set-valued mappings.
Though does not generally invert , there is one special case when inversion is possible. If are compact convex sets, then schneider1993 . Using this property, we describe a new central limit theorem for random sets with a particular structure that is useful for statistical applications. Specifically, this result applies to randomly translated sets (RaTS), which are random sets of the form , where is a deterministic compact convex set, and is a (vector-valued) random variable.
Theorem 3.4 (Central Limit Theorem for RaTS)
Suppose the probability space is nonatomic, and that , , are i.i.d. random sets with , where is a deterministic compact convex set and , , are i.i.d. (vector-valued) random variables with zero mean and finite variance. Then
[TABLE]
in distribution, where is a jointly Gaussian random variable with zero mean and covariance matrix given by .
Proof
Since , the result follows by the classical central limit theorem bickel2006 .∎
The benefit of this new formulation of the central limit theorem is that it has a clear geometrical interpretation like the classical central limit theorem for random variables, but unfortunately this result only applies to the specific class of RaTS.
4 Kernel Regression
We will construct a nonparametric estimator for set-valued functions using an approach that can be viewed as a natural generalization of kernel regression methods for functions aswani2011 ; aswani2013 ; bickel1982 ; noda1976 ; wand1994 . These techniques are considered nonparametric because, in contrast to parametric models with a finite number of parameters, the number of parameters in nonparametric models increases as the amount of data increases.
4.1 Problem Setup
Consider a Lipschitz continuous set-valued function with random samples for , where: is a convex compact set; is a convex compact set for each ; are i.i.d. (vector-valued) random variables with a Lipschitz continuous density function that has the property for ; and with i.i.d. (vector-valued) random variables that have zero mean and finite variance . The problem is to estimate at any using the above described samples, and we need convexity of to ensure its tangent cone is derivable at rockafellar2009 ; however, our results will hold for all unconditional of any such regularity assumptions.
4.2 Kernel Functions
Kernel regression is so named because these approaches use kernel functions , which are functions that are non-negative, bounded, even (i.e., ), and have finite support (i.e., there is a constant such that when , and for ). One example of a kernel function is the indicator function , and another example is the Epanechnikov kernel . Notationally, it is useful to define the family of kernel functions and the function , where is the tangent cone of at the point . (Note is strictly greater than zero and finite because of the assumptions.) We first prove a lemma about :
Lemma 2
If , then for we have
** 2.
** 3.
**
Proof
We prove these three results by verifying the hypothesis of Kolmogorov’s strong law of large numbers holds in each case, then applying this law of large numbers, and finally computing the expectation of the corresponding quantity in each case. To prove the first result, observe that
[TABLE]
where the first inequality holds for some constant because is bounded and nonzero with probability at most for some constant ; and the second inequality holds because . The finiteness of the above summation means we can apply Kolmogorov’s strong law of large numbers, which gives . Our next step is to compute this expectation. Note that
[TABLE]
where in the last line we made the change of variables . Let and . So we have
[TABLE]
where is the Lipschitz constant of the density , and is a constant that exists by continuity of . Next note as by Proposition 6.2 and Theorem 4.10 of rockafellar2009 . Thus taking the limit of (14) gives . This proves the first result when combined with the implication of Kolmogorov’s strong law of large numbers in our setting, and after noting since for .
For the proof of the second result, let denote the -th component of the vector . Next observe that
[TABLE]
where the first inequality holds for some constant because the have zero mean and because is bounded and nonzero with probability at most for some constant ; and the second inequality holds because . The finiteness of the above summation means Kolmogorov’s strong law of large numbers gives . But the are zero mean, and so we have that .
To prove the third result, observe that
[TABLE]
where the first inequality holds for some constant because is a compact set and because is bounded and nonzero with probability at most for some constant ; and the second inequality holds because . The finiteness of the above summation means we can apply Kolmogorov’s strong law of large numbers, which gives . Our next step is to compute this expectation. Note that
[TABLE]
where the second line makes the change of variables , and the third line holds for some constant because the kernel has finite support and the density is continuous. The above expectation is non-negative, and so . This proves the third result when combined with the outcome of Kolmogorov’s strong law of large numbers. ∎
4.3 Kernel Regression Estimator
We define a kernel regression estimate of at the point to be
[TABLE]
The following theorem proves the strong pointwise consistency of this estimator.
Theorem 4.1
If , then for .
Proof
Let be the Lipschitz constant of , and note that by Lipschitz continuity we have
[TABLE]
Corollary 23 and Lemma 21 give , and Corollary 21 and Lemma 22 yield . Corollary 23 and Lemma 23 give , and so Corollary 21 implies \operatorname*{as-lim}_{n}\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S(u)\oplus 2\kappa\|X_{i}-u\|\mathbb{B}\oplus w_{i}\big{)}=\gamma(u)\cdot f_{X}(u)\cdot S(u). So applying the sandwich lemma to (19) yields \operatorname*{as-lim}_{n}\textstyle\frac{1}{n}\bigoplus_{i=1}^{n}\varphi_{h}(X_{i}-u)\cdot\big{(}S_{i}\oplus\kappa\|X_{i}-u\|\mathbb{B}\big{)}=\gamma(u)\cdot f_{X}(u)\cdot S(u). Corollary 3 gives . Finally, using Corollary 24 and Lemma 21 imply that . ∎
4.4 Algorithms to Compute Kernel Regression Estimator
The statistical consistency of our kernel regression estimator is a theoretical result, and numerical computation of this estimator using the measured data for needs some discussion. The key point is that the corresponding algorithm used to compute the estimator depends on the representation of the sets . Since the random sets are RaTS, we only need to consider different representations of convex sets. Moreover, we focus our discussion on polytope representations since any compact convex set can be approximated arbitrarily well by polytopes schneider1993 .
If the sets are each represented by polynomial time membership oracles, then
[TABLE]
and so membership in the Minkowski sum can be determined in polynomial time. Polynomial time membership oracles exist for in a known compact set , with a self-concordant barrier function for and the functions defining nesterov1994 : The measurement of would consist of the function parameters defining , and set membership is determined by using interior point to solve a feasibility problem. Examples include polytopes , with measured data ; second-order cone sets , with measured data ; and combinations thereof. Other examples can be found in nesterov1994 .
Next suppose the sets are each represented by the zonotopes , where are weights and are vectors, which are polytopes defined as the Minkowski sum of vectors. Restated, the observations are the and . Then
[TABLE]
and so the Minkowski sum is polynomial time computable for this representation.
Lastly, suppose the sets are represented by the convex hull of a finite set of vertices, meaning that . In this setting the measurements are the vertices of each set , and the Minkowski sum is given by
[TABLE]
This is a polynomial time computation since the number of vertices is finite.
4.5 Numerical Example
We conclude our discussion on kernel regression of set-valued functions with a numerical example to visually demonstrate the estimation problem being solved by our estimator. Consider the set-valued function in the bottom-left of Fig. 1, given by
[TABLE]
The variables have a distribution, and each measurement is in a vertex representation. The noise has a distribution, meaning its variance is . The top row of Fig. 1 shows measurements for , , and data points, respectively; and the bottom row shows estimates computed by (18) and (22) with an Epanechnikov kernel.111Our code http://ieor.berkeley.edu/~aaswani/code/ssvf.zip runs in a few seconds. This example shows that as the amount of data increases, the estimates converge pointwise to the actual set-valued function .
5 Inverse Approximate Optimization
Inverse optimization involves computing parameters that make measured solutions optimal ahuja2001 ; aswani2015 ; bertsimas2015 ; chan2014 ; esfahani2015 ; keshavarz2011 . In contrast, the inverse approximate optimization problem makes noisy measurements of suboptimal solutions, and the goal is to estimate the amount of suboptimality and to estimate the parameters of optimization problem generating the data. In principle, the VIA bertsimas2015 and KKT keshavarz2011 estimators can provide estimates of the desired quantities; but we show their estimates are statistically inconsistent. As a result, we construct an estimator for inverse approximate optimization, prove its statistical consistency, and then discuss some possible generalizations.
5.1 Problem Setup
Consider a parametric convex optimization problem
[TABLE]
in which are continuous functions that are convex in for each fixed value of and , and assume that for all the constraint qualification there exists such that holds. (Note this constraint representation is fully general since we can write .) We use the definition that -optimal solutions are those in the set
[TABLE]
Our results also apply when -optimal solutions are defined as in (25) but with . The difference is (25) does not allow any constraint violation, while the alternative definition allows constraint violation. Note there are other notions of -optimal solutions like distance to the KKT graph, but we do not consider these.
Now suppose -optimal solutions of (24) generate random samples for , where: are i.i.d. (vector-valued) random variables distributed on the set ; , where are i.i.d. (vector-valued) random variables distributed on with constants and ; and are i.i.d. (vector-valued) random variables with zero mean and distributed on a known convex set with finite support (which implies finite variance). We also assume the densities of are strictly positive on the interior of their supports (i.e., for and for ).
The inverse approximate optimization problem is to estimate using the for . Note that we assume the functional forms of are fixed. Let be a known closed set such that , and let be a known compact set such that ; the intuition is that these sets represent prior knowledge that constrain the parameters and amount of solution suboptimality. The choice corresponds to a situation with no such prior knowledge on , and the compactness assumption on is not restrictive in practice because this set can be made arbitrarily large. (Unbounded can also be used when a compactification with certain technical properties exists bahadur1971some .) A so-called identifiability condition bickel2006 is also needed. We assume that if then for all and . An identifiability condition (such as the one we have assumed) intuitively says that different parameters of the model produce different outputs.
5.2 Inconsistency of Existing Estimators
The VIA bertsimas2015 (which minimizes the first order suboptimality of the data) and KKT keshavarz2011 (which minimizes the KKT suboptimality of the data) estimators are statistically inconsistent for aswani2015 , but since these approaches minimize the amount of suboptimality of the measured data it is initially unclear without further analysis whether these approaches are inconsistent for problem instances with . The following result provides qualitative insights into the behavior of these estimators.
Proposition 4
Let be a constant, and suppose , , , , and are uniformly distributed. Then estimates generated by the VIA bertsimas2015 and KKT keshavarz2011 methods are such that .
Proof
The KKT estimate is given by
[TABLE]
where these are the minimizers of the below optimization problem
[TABLE]
Since under the hypothesis of this proposition, it holds the are i.i.d. and have triangular distribution with lower limit , upper limit , and mode [math]. Hence the density of is given by
[TABLE]
where is the Dirac delta function. So , and the strong law of large numbers implies .
The VIA estimate is given by where are the minimizers to
[TABLE]
However, observe that for simplifies to the constraint . Since the above optimization is minimizing each , this means the constraint will be at optimality. Recall that as shown in the proof for KKT, the have a triangular distribution with lower limit , upper limit , and mode [math]. This means . Applying the strong law of large numbers gives . ∎
This proposition shows that existing approaches cannot distinguish between noise in measurements versus suboptimality of the solutions. The reason is that these approaches are minimizing an incorrect error metric: They minimize the amount of suboptimality of the measured data, and this is an incorrect error metric when the measured data is noisy because the noise increases the suboptimality of the measured data. Moreover, this indistinguishability of existing approaches is unbounded in the sense that as the noise variance increases then their estimates of suboptimality increase in an unbounded way. Such behavior is undesirable, and in fact the above result gives the following corollary on the statistical properties of VIA and KKT.
Corollary 4
The VIA bertsimas2015 and KKT keshavarz2011 estimators are statistically inconsistent.
Proof
By definition an estimator is consistent for a class of models if and only if it is consistent for each model in that class. Thus to show inconsistency of VIA and KKT it suffices to show inconsistency for a single model. The above proposition establishes inconsistency of VIA and KKT for a particular model because while , meaning these approaches are inconsistent when . ∎
5.3 Approximate Bilevel Programming (ABP) Estimator
To correct the indistinguishability (between suboptimality of solutions and noise in measurements) problem faced by existing approaches, we instead propose an estimator that explicitly models the measured data as consisting of a suboptimal solution added to noise. More specifically, we propose the following statistical estimator
[TABLE]
where and is the squared distance function defined in the preliminaries. It is also useful to consider estimators defined as approximate solutions to the above optimization problem. Let be a nonnegative value, and define the estimates
[TABLE]
For notational convenience, we will call this estimator the ABP estimator. Note these estimates are defined as being any of the optimization problem (30).
Theorem 5.1
The ABP estimator is strongly statistically consistent, meaning we have whenever and .
Proof
Our first step is to show satisfies certain continuity properties. Note is continuous by Example 5.10 of rockafellar2009 , and so is continuous by the Berge maximum theorem berge1963 . Noting , we can apply the Berge maximum theorem berge1963 since this feasible set is osc by Example 5.8 of rockafellar2009 : This implies is lower semicontinuous in , and so is lower semicontinuous in by Fatou’s lemma.
Next note that the estimate also minimizes the optimization problem
[TABLE]
But by assumption, and so the objective of (32) is nondecreasing in . Hence by Proposition 7.4 of rockafellar2009 , where is the epi-limit rockafellar2009 . We next prove that
[TABLE]
and our approach is to use a well-known covering argument originally due to Wald wald1949note . Let be a decreasing sequence (i.e., ) of open neighborhoods of , with . Since is lower semicontinuous in , this means by the identifiability condition. Thus there exists such that for . By lower semicontinuity of and the monontone convergence theorem, there exists an open neighborhood for each so that we have . Since is compact, there exists a finite set such that for forms a finite subcover of . Combining the above with the Borel-Cantelli lemma implies for each , which by the finiteness of implies that . The desired (33) follows since we choose the sequence such that .
Next consider the optimization problem
[TABLE]
Note is feasible for both (32) and (34), and so the minimums of (32) and (34) are both less than or equal to . This means cannot minimize (34). Furthermore, using (33) implies that almost surely the (unique) minimizer of (34) is , and almost surely the minimum value of (34) is . But from the argument in the preceding paragraph, (32) epi-converges almost surely to (34) since are fixed. The result now follows from Theorem 7.33 of rockafellar2009 . ∎
The above result concerns almost sure convergence of the ABP estimates to the actual parameters , but a related question is whether the corresponding solution set estimates converge to the actual solution sets . Our semicontinuous mapping theorem can be used to establish almost sure convergence of the solution set estimates, and this argument leads to the the following corollary.
Corollary 5
We have that for . If or is strictly convex in , then for .
Proof
The above proof established that is osc in . And so the first part of the corollary follows by the semicontinuous mapping theorem. If then is continuous at by Example 5.10 of rockafellar2009 . If is strictly convex in , then is single-valued rockafellar2009 . Hence is continuous because a single-valued, osc, and locally bounded function is continuous rockafellar2009 . Thus the second part of the corollary follows from the semicontinuous mapping theorem.∎
5.4 Algorithms to Compute ABP Estimator
We next discuss numerical computation of ABP using the data for . The ABP estimator is an approximate (i.e., the solution sets have possibly greater than zero) bilevel program, which are optimization problems where some decision variables are solutions to optimization problems that are called the lower level problem. One approach to solve bilevel programs replaces the lower level problem with its KKT conditions allende2013 ; dempe2012 , and this can sometimes be rewritten as mixed-integer programs that may be numerically solved quickly aswani2016_wl . Another approach upper bounds the objective function of the lower level problem by its value function outrata1990 ; ye1995 .
Here we describe how a third approach that upper bounds the objective function of the lower level problem by its dual function aswani2015 ; aswani2016 can be used to compute the ABP estimator. If is the Lagrangian dual function corresponding to (24), then under mild conditions ensuring zero duality gap the ABP estimator is given by
[TABLE]
This duality-based reformulation can be numerically solved by two different algorithms aswani2015 ; aswani2016 , which we briefly describe here. More details can be found in the corresponding references, and both algorithms assume the sets are compact.
Since the reformulation (35) is a convex optimization problem for fixed , one algorithm aswani2015 for computing ABP is to: discretize the set into a finite set such that it forms a set covering with balls of a prescribed radius, compute the minimum objective function value of (35) for (which we call ), and then choose estimates . A result from aswani2015 implies that estimates chosen using this enumeration algorithm satisfy the assumptions of Theorem 5.1, which is sufficient for statistical consistency.
A second algorithm aswani2016 replaces the Lagrangian dual by a numerically computed dual. Partial dualization is used to define a regularized dual function (RDF)
[TABLE]
Here, is any compact set defined such that . The intuition is that is a set that contains all the feasible sets of (24) within its interior. When does not depend on , we can choose with and that are computed by solving convex optimization problems. Many applications of inverse approximate optimization consist of such a setting where the feasible set is independent of the inputs or the parameters . The benefit of the RDF is it can be numerically computed because it is a convex optimization problem, and that its gradient
[TABLE]
always exists when . In contrast, the Lagrangian dual is usually only directionally differentiable but not differentiable. The algorithm proceeds by using a nonlinear numerical solver to solve a sequence of optimization problems in which goes to [math].
A third possibility is a polynomial time approximation algorithm with the property that statistical consistency holds as the amount of samples increases to infinity. Such an algorithm has been constructed, when is affine in and does not depend on , for inverse optimization with noisy data aswani2015 ; it uses kernel regression to pre-smooth the data and then solves a convex problem corresponding to inverse optimization assuming no noise in the pre-smoothed data. Here we sketch a similar algorithm for inverse approximate optimization, and we leave its analysis for future work. Define for , and choose the data by sampling from the uniform distribution on . The estimate is computed by solving (35) with the change that the constraints are removed.
5.5 Numerical Example
We next consider a numerical example to visually compare estimates of produced by our ABP estimator and the VIA bertsimas2015 and KKT keshavarz2011 estimators. Suppose , , , , , , has a uniform distribution , is uniformly distributed on , , and . The solution set in this setting is shown in the left column of Fig. 2. Each measurement for this example is a point, and the top row of Fig. 2 shows the measurements for , , and data points, respectively. The rows below show the estimated (using the measurements shown above) solution set as computed by ABP, KKT, and VIA, respectively222Our code http://ieor.berkeley.edu/~aaswani/code/ssvf.zip runs in about three hours.. This example shows that as the number of measurements increases, the solution set estimated by ABP (KKT and VIA) converges (does not converge) to the actual solution set. This statistical behavior is expected given our theoretical results on the strong consistency of ABP and the statistical inconsistency of KKT and VIA.
5.6 Related Inverse Optimization Problems
In our problem setup, the measurement noise had a distribution with a finite support. However, noise models commonly used in statistics include distributions with unbounded support but finite variance. The canonical example is that are jointly Gaussian with zero mean and finite covariance. A heuristic approach for distributions with unbounded support is to use our ABP estimator with the choices of for sub-Gaussian distributions (i.e., distributions bounded from above by a jointly Gaussian random variable) and for sub-exponential distributions (i.e., distributions with exponentially decaying tails), where is the covariance matrix of . The reason for this suggested heuristic is these choices of are analogous to bounds on the maximum expected values of sub-Gaussian and sub-exponential random variables boucheron2013 .
Since the ABP estimator is a heuristic in this setting, an obvious topic is to design a statistically consistent estimator for inverse approximate optimization problems with unbounded noise. Maximum likelihood estimation is arguably the most natural approach because otherwise it is difficult to distinguish between noise and suboptimality of solutions. Specifically, consider the original problem setup but with the changes that the random sample is , the are uniformly distributed within , and that is distributed according to some known density . Then the maximum likelihood estimator (MLE) for this modified problem setup is given by
[TABLE]
This optimization problem has a challenging structure in which the domains of integration depend upon the decision variables royset2017variational , and presents an opportunity for the further study of designing numerical algorithms to solve such optimization problems. We do note that for fixed , the integrals in the objective can be numerically computed in polynomial time using hit-and-run techniques for sampling from convex sets lovasz2006 ; smith1984 . And so the enumeration algorithm we described earlier for the ABP estimator could be easily modified to solve this MLE problem.
Remark 6
The ABP and MLE estimators are actually qualitatively the same. The term in ABP and the term in MLE both penalize estimates in which the solutions are far from the solution sets , and the term in MLE and the term in ABP both penalize estimates that generate large solution sets.
In the two inverse approximate optimization problem setups considered above, we assumed the approximate solutions were drawn from the solution sets according to some distribution. However, another modified problem setup would be to assume the were chosen from the solution sets by solution of another optimization problem. This kind of setup corresponds to a scenario in which the are solutions to an optimistic bilevel optimization problem with unique solutions:
[TABLE]
In this case, the estimation procedure can be posed as a least squares problem
[TABLE]
This is a challenging multi-level optimization problem and presents an opportunity for the further study of designing numerical algorithms to solve such optimization problems. We do note that for fixed , this becomes a convex optimization problem. And so the enumeration algorithm we described earlier for the ABP estimator could be easily modified to solve this least squares problem.
6 Conclusion
In this paper, we used variational analysis to develop tools for statistics with set-valued functions, and then applied these tools to two estimation problems. We constructed and studied a kernel regression estimator for set-valued functions and an estimator for the inverse approximate optimization problem. The area of statistics with set-valued functions remains largely unexplored with many remaining problems. One question is the design of numerical representations of sets and set-valued functions. Though constraint representations of sets are pervasive, numerical machinery like epi-splines royset2016 may offer greater representational flexibility. Another question is the development of numerical algorithms to solve optimization problems that arise in statistical estimation for set-valued functions. Related inverse optimization problems lead to formulations (38) and (40) with structures that are not well-studied from the perspective of numerical optimization. Further study of statistics with set-valued functions will require developing new numerical methods and optimization theory.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Ahuja, R., Orlin, J.: Inverse optimization. Operations Research 49 (5), 771–783 (2001)
- 2(2) Allende, G., Still, G.: Solving bilevel programs with the KKT-approach. Mathematical programming 138 (1-2), 309 (2013)
- 3(3) Artstein, Z., Vitale, R.: A strong law of large numbers for random compact sets. The Annals of Probability pp. 879–882 (1975)
- 4(4) Aswani, A., Bickel, P., Tomlin, C.: Regression on manifolds: Estimation of the exterior derivative. The Annals of Statistics pp. 48–81 (2011)
- 5(5) Aswani, A., Gonzalez, H., Sastry, S., Tomlin, C.: Provably safe and robust learning-based model predictive control. Automatica 49 (5), 1216–1226 (2013)
- 6(6) Aswani, A., Kaminsky, P., Mintz, Y., Flowers, E., Fukuoka, Y.: Behavioral modeling in weight loss interventions. Available at SSRN: https://ssrn.com/abstract=2838443 (2016)
- 7(7) Aswani, A., Shen, Z.J., Siddiq, A.: Inverse optimization with noisy data. Operations Research (2017). Accepted
- 8(8) Aumann, R.J.: Integrals of set-valued functions. Journal of Mathematical Analysis and Applications 12 (1), 1–12 (1965)
