Optimization Problems for Machine Learning: A Survey
Claudio Gambella, Bissan Ghaddar, Joe Naoum-Sawaya

TL;DR
This survey reviews how various machine learning methods can be formulated as optimization problems, highlighting recent advances, challenges, and future research directions in the field.
Contribution
It provides a comprehensive overview of optimization models across multiple machine learning approaches and discusses their strengths, limitations, and emerging applications.
Findings
Optimization models enhance machine learning techniques.
Numerical optimization advances benefit various ML applications.
Open problems and future research directions are identified.
Abstract
This paper surveys the machine learning literature and presents in an optimization framework several commonly used machine learning approaches. Particularly, mathematical optimization models are presented for regression, classification, clustering, deep learning, and adversarial learning, as well as new emerging applications in machine teaching, empirical model learning, and Bayesian network structure learning. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. The strengths and the shortcomings of these models are discussed and potential research directions and open problems are highlighted.
| layers indices. | |
| number of units, or neurons, in layer . | |
| element-wise activation function. | |
| -th unit of layer . | |
| weight matrix for layer . | |
| bias vector for layer . | |
| training dataset, with observations and responses | |
| output vector of layer ( indicates input feature vector, indicates derived feature vector). |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Optimization Problems for Machine Learning: A Survey
Claudio Gambella 1, Bissan Ghaddar 2, Joe Naoum-Sawaya2
(1 IBM Research Ireland, Mulhuddart, Dublin 15, Ireland, 2 Ivey Business School, University of Western Ontario, London, Ontario N6G 0N1, Canada )
Abstract
This paper surveys the machine learning literature and presents in an optimization framework several commonly used machine learning approaches. Particularly, mathematical optimization models are presented for regression, classification, clustering, deep learning, and adversarial learning, as well as new emerging applications in machine teaching, empirical model learning, and Bayesian network structure learning. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. The strengths and the shortcomings of these models are discussed and potential research directions and open problems are highlighted.
Contents
1 Introduction
The pursuit to create intelligent machines that can match and potentially rival humans in reasoning and making intelligent decisions goes back to at least the early days of the development of digital computing in the late 1950s [198]. The goal is to enable machines to perform cognitive functions by learning from past experiences and then solving complex problems under conditions that are varying from past observations. Fueled by the exponential growth in computing power and data collection coupled with the widespread of practical applications, machine learning is nowadays a field of strategic importance.
1.1 Machine Learning Basics
Broadly speaking, machine learning relies on learning a model that returns the correct output given a certain input. The inputs, i.e., predictor measurements, are typically values that represent the parameters that define a problem, while the output, i.e., response, is a value that represents the solution.
Machine learning models fall into two categories: supervised and unsupervised learning [99, 129]. In supervised learning, a response measurement is available for each observation of predictor measurements and the aim is to fit a model that accurately predicts the response of future observations. More specifically, in supervised learning, values of both the input and the corresponding output are available and the objective is to learn a function that approximates with a reasonable margin of error the relationship between the input and the corresponding output. The accuracy of a prediction is evaluated using a loss function which computes a distance measure between the predicted output and the actual output. In a general setting, the best predictive model is the one that minimizes the risk
[TABLE]
where is the probability of observing data point [210]. In practice is unknown, however the assumption is that an independent and identically distributed sample of data points forming the training dataset is given. Thus instead of minimizing the risk, the best predictive model is the one that minimizes the empirical risk such that
[TABLE]
When learning a model, a key aspect to consider is model complexity. Learning a highly complex model may lead to overfitting, which refers to having a model that fits the training data very well but generalizes poorly to other data. The minimizer of the empirical risk will often lead to overfitting, and hence has a limited generalization property. Furthermore, in practice the data may contain noisy and incorrect values, i.e., outliers, which impacts the value of the empirical risk and subsequently the accuracy of the learned model. Attempting to find a model that perfectly fits every data point in the dataset is thus not desired, since the predictive power of the model will be diminished when points that are far from typical are fitted. Usually, the choice of is restricted to a family of functions such that
[TABLE]
The degree of model complexity is generally dictated by the nature and size of the training data. While simpler models are advised for small training datasets that do not uniformly cover the possible data ranges, complex models need large data sets to avoid overfitting.
On the other hand, in unsupervised learning, response variables are not available and the goal of learning is to understand the underlying characteristics of the observations. Unsupervised learning thus attempts to learn from the distribution of the data the distinguishing features and the associations in the data. As such, the main use-case for unsupervised learning is exploratory data analysis, where the purpose is to segment and cluster the samples in order to extract insights. While with supervised learning there is a clear measure of accuracy by evaluating the prediction to the known response, in unsupervised it is difficult to evaluate the validity of the derived structure.
The fundamental theory of machine learning models and consequently their success can be largely attributed to research at the interface of computer science, statistics, and operations research. The relation between machine learning and operations research can be viewed along three dimensions: (a) machine learning applied to management science problems, (b) machine learning to solve optimization problems, (c) machine learning problems formulated as optimization problems.
1.2 Machine Learning and Operations Research
Leveraging data in business decision making is nowadays mainstream as any business in today’s economy is instrumented for data collection and analysis. While the aim of machine learning is to generate reliable predictions, management science problems deal with optimal decision making. Thus, methodological developments that can leverage data predictions for optimal decision making is an area of research that is critical for future business value [30, 146, 172].
Another area of research at the interface of machine learning and operations research is using machine learning to solve hard optimization problems and particularly -hard integer constrained optimization [41, 138, 139, 158, 214]. In that domain, machine learning models are introduced to complement existing approaches that exploit combinatorial optimization through structure detection, branching, and heuristics.
Lastly, the training of machine learning models can be naturally posed as an optimization problem with typical objectives that include optimizing training error, measure of fit, and cross-entropy [42, 43, 77, 221]. In fact, the widespread adoption of machine learning is in part attributed to the development of efficient solution approaches for these optimization problems, which enabled the training of machine learning models. As we review in this paper, the development of these optimization models has largely been concentrated in areas of computer science, statistics, and operations research. However, diverging publication outlets, standards, and terminology persist.
1.3 Aim and Scope
The aim of this paper is to present machine learning as optimization problems. For that, in addition to publications in classical operations research journals, this paper surveys machine learning and artificial intelligence conferences and journals, such as the conference on Association for the Advancement of Artificial Intelligence and the International Conference on Machine Learning. Furthermore, since machine learning research has rapidly accelerated with many important papers still in the review process, this paper also surveys a considerable number of relevant papers that are available on the arXiv repository. This paper also complements the recent surveys of [43, 77, 221] which described methodological developments for solving machine learning optimization problems; [21, 158] which discussed how machine learning advanced the solution approaches of mathematical programming; [71, 178] which described the interactions between operations research and data mining; [25] which surveyed solution approaches to machine learning models cast as continuous optimization problems; and [199] which provided an overview on the various levels of interaction between optimization and machine learning.
This paper presents optimization models for regression, classification, clustering, and deep learning (including adversarial attacks), as well as new emerging paradigms such as machine teaching and empirical model learning. Additionally, this paper highlights the strengths and the shortcomings of the models from a mathematical optimization perspective and discusses potential novel research directions. This is to foster efforts in mathematical programming for machine learning. While important criteria for contributions in operations research are the convergence guarantees, deviation to optimality and speed increments with respect to benchmarks, machine learning applications have a partly different set of goals, such as scalability, reasonable execution time and memory requirement, robustness and numerical stability and, most importantly, generalization [25]. It is therefore common for mathematical programming approaches to sacrifice optimality (local or global) and convergence guarantees to obtain better generalization properties, by adopting strategies such as early stopping [184].
Following this introductory section, regression models are discussed in Section 2 while classification and clustering models are presented in Sections 3 and 4, respectively. Linear dimension reduction methods are reviewed in Section 5. Deep learning models are presented in Section 6, while models for adversarial learning are discussed in Section 7. New emerging paradigms that include machine teaching and empirical model learning are presented in Section 8. Finally, conclusions are drawn in Section 9.
2 Regression Models
2.1 Linear Regression
Since the early era of statistics, linear regression models have been widely adopted in supervised learning for predicting a quantitative response. The central assumption is that the relationship between the dependent variables (feature measurements, or predictors, or input vector) and the independent variable (real-valued output, or response) is representable with a linear function (regression function) with a reasonable accuracy. Linear regression models preserve considerable interest, given their simplicity, their extensive range of applications, and the ease of interpretability. In particular, machine learning interpretability, in its simplest form, is the ability to explain in a humanly understandable way the role of the inputs in the outcome [86].
Linear regression aims to find a linear function that expresses the relation between an input vector of dimension and a real-valued output such as
[TABLE]
where is the intercept of the regression line and is the vector of coefficients corresponding to each of the input variables. In order to estimate the regression parameters and , one needs a training set where denotes training inputs and denotes training outputs where each is associated with the real-valued output . The objective is to minimize the empirical risk (1), in order to quantify via the association between predictor and the response, for each .
The most commonly used loss function for regression is the least squared estimate, where fitting a regression model reduces to minimizing the residual sum of squares (RSS) between the labels and the predicted outputs, such as
[TABLE]
The least squares estimate is known to have the smallest variance among all linear unbiased estimates, and has a closed form solution. However, this choice is not always ideal for fitting, since it can yield a model with low prediction accuracy, due to a large variance, and often leads to a large number of non-zero regression coefficients (i.e., low interpretability). Shrinkage methods discussed in Section 2.2 and Linear Dimension Reduction discussed in Section 5 are alternatives to the least squared estimate. Forward or backward elimination are also commonly used approaches to perform variable selection and to avoid overfitting [99].
The process of gathering input data is often affected by noise, which can impact the accuracy of statistical learning methods. A model that takes into account the noise in the features of linear regression problems is presented in [27], which also investigates the relationship between regularization and robustness to noise. The noise is assumed to vary in an uncertainty set , and the learner adopts the robust perspective:
[TABLE]
where is a convex function that measures the residuals (e.g., a norm function). The characterization of the uncertainty set directly influences the complexity of problem (4).
The design of high-quality linear regression models requires several desirable properties, which are often conflicting and not simultaneously implementable. A fitting procedure based on Mixed Integer Quadratic Programming (MIQP) is presented in [31] and takes into account sparsity, joint inclusion of subset of features (called selective sparsity), robustness to noisy data, stability against outliers, modeler expertise, statistical significance, and low global multicollinearity. Mixed Integer Programming (MIP) models for regression and classification are also investigated in [33]. The regression problem is modeled as an assignment of data points to groups with the same regression coefficients.
In order to speed up the fitting procedure and improve the interpretability of the regression model, irrelevant variables can be excluded via feature selection strategies. For example, feature selection is desired in case some regression variables are highly correlated. Multicollinearity can be detected by the condition number of the correlation matrix or the variance influence factor (VIF) [61]. To achieve feature selection in this case, [202] introduces a mixed integer semidefinite programming formulation to eliminate multicollinearity by bounding the condition number. The approach requires to solve a single optimization problem, in contrast with the cutting plane algorithm of [31]. Alternatively, [203] proposes a mixed integer quadratic optimization formulation with an upper bound on VIF, which is a better-grounded statistical indicator for multicollinearity with respect to the condition number.
2.2 Shrinkage methods
Shrinkage methods (also called regularization methods) seek to diminish the value of the regression coefficients. The aim is to obtain a more interpretable model (with less relevant features), at the price of introducing some bias in the model determination. A well-known shrinkage method is Ridge regression, where a -norm penalization on the regression coefficients is added to the loss function such that
[TABLE]
where controls the magnitude of shrinkage.
Another technique for regularization in regression is the lasso regression, which penalizes the -norm of the regression coefficients, and seeks to minimize the quantity
[TABLE]
When is sufficiently large, the -norm penalty forces some of the coefficient estimates to be exactly equal to zero, hence the models produced by the lasso are more interpretable than those obtained via Ridge regression.
Ridge and lasso regression belong to a class of techniques to achieve sparse regression. As discussed in [32, 34], sparse regression can be formulated as the best subset selection problem [167]
[TABLE]
where is an upper bound on the number of predictors with a non-zero regression coefficient, i.e., the predictors to select, and is the number of non-zero entries of , which is commonly referred to as the “[math]-norm” (though is not technically a norm as it does not satisfy the homogeneity property) . Problem (7)–(9) is -hard, as proven in [174]. The recent work of [32] demonstrated that the best subset selection can be solved to near-optimal solutions using optimization techniques for values of in the hundreds or thousands. Specifically, by introducing the binary variables , the sparse regression problem can be transformed into the MIQP formulation
[TABLE]
where is a large constant, . Since the choice of the data dependent constant largely affects the strength of the MIQP formulation, alternative formulations based on Specially Ordered Sets Type I can be devised [70].
As discussed in [120], the prediction accuracy of best subset selection is however highly dependent on the noise present in the input dataset, and it is not possible to establish a dominance relationship over lasso regression and forward stepwise selection [91]. In order to limit the effect of noise in the input data, make the model more robust, and to avoid numerical issues, [34] introduces the Tikhonov regularization term with weight into the objective function of problem (10)–(14), which is then solved using a cutting plane approach.
The task of finding a linear model to express the relationship between regressors and output is a particular case of selecting the hyperplane that minimizes a measure of the deviation of the data with respect to the induced linear form. As presented in [37], locating a hyperplane to fit a set of points , consists of finding , where is a mapping to the residuals of the points on the hyperplane (according to a distance measure in ), and is an aggregation function on the residuals (e.g., residual sum of squares, least absolute deviation [89]). If the number of points is much smaller than the dimension of the space, feature selection strategies can be applied [32, 169]. We note that hyperplane fitting is a variant of facility location problems [84, 192].
2.3 Regression Models Beyond Linearity
A natural extension of linear regression models is to consider nonlinear terms, which may capture complex relationships between regressors and predictors. Nonlinear regression models include, among others, polynomial regression, exponential regression, step functions, regression splines, smoothing splines and local regression [99, 129]. Alternatively, the Generalized Additive Models (GAMs) [119] maintain the additivity of the original predictors and the relationship between each feature and the response is expressed using nonlinear functions such as
[TABLE]
GAMs may increase the flexibility and accuracy of the predictions with respect to linear models, while maintaining a certain level of interpretability of the predictors. However, one limitation is given by the assumption of additivity of the features. To further increase the model flexibility, one could include predictors of the form , or consider non-parametric models, such as random forests and boosting. It has been empirically observed that GAMs do not represent well problems where the number of observations is much larger than the number of predictors. In [204] the Generalized Additive Model Selection is introduced to fit sparse GAMs in high dimension with a penalized likelihood approach. The penalty term is derived from the fitting criterion for smoothing splines. Alternatively, [68] proposes to fit a constrained version of GAMs by solving a conic programming problem.
As an intermediate model between linear and nonlinear relationships, compact and simple representations via piecewise affine models have been discussed in [137]. Piecewise affine forms emerge as candidate models when the fitting function is known to be discontinuous [94], separable [81], or approximate to complex nonlinear expressions [80, 187, 213]. Fitting piecewise affine models involves partitioning the domain of the input data into subdomains and fitting for each subdomain an affine function , in order to minimize a measure of the overall fitting error. To facilitate the fitting procedure, the domain is partitioned a priori (see -hyperplane clustering in Section 4.3). Neglecting domain partitioning may lead to large fitting errors. In contrast, [5] considers both aspects in determining piecewise affine models for piecewise linearly separable subdomains via a mixed integer linear programming formulation and a tailored heuristic. Mixed integer models are also proposed in [207], however a partial knowledge of the subdomains is required. Alternatively, clustering techniques can be adopted for domain partitioning [94].
3 Classification
The task of classifying data is to decide the class membership of an unlabeled data item based on the training dataset where each has a known class membership . A recent comparison of machine learning techniques for binary classification is found in [18]. This section reviews the common binary and multiclass classification approaches that include logistic regression, linear discriminant analysis, decision trees, and support vector machines.
3.1 Logistic Regression
In most problem domains, there is no functional relationship between and . In this case, the relationship between and has to be described more generally by a probability distribution while assuming that the training contains independent samples from . In this section, the label is assumed to be binary, i.e., . The optimal class membership decision is to choose the class label that maximizes the posterior distribution . Logistic regression calculates the class membership probability for one of the two categories in the dataset as
[TABLE]
The decision boundary between the two binary classes is formed by a hyperplane whose equation is . Points at this decision boundary have . The parameters and are usually obtained by maximum-likelihood estimation [87]
[TABLE]
which is equivalent to
[TABLE]
Problem (16) is convex and differentiable and first order methods such as gradient descent as well as second order methods such as Newton’s method can be applied to find a global optimal solution.
To tune the logistic regression model and to avoid overfitting, variable selection can be performed where only the most relevant subsets of the variables are kept in the model [99]. Heuristic approaches such as forward selection or backward elimination can be applied to add or remove variables respectively, based on the statistical significance of each of the computed coefficients. Interaction terms can be also added to further complicate the model at the risk of overfitting the training data.
3.2 Linear Discriminant Analysis
Linear discriminant analysis (LDA) is an approach for classification and dimensionality reduction. It is often applied to data that contains a large number of features (such as image data) where reducing the number of features is necessary to obtain robust classification. While LDA and Principal Component Analysis (PCA) (see Section 5.1) share the commonality of dimensionality reduction, LDA tends to be more robust than PCA since it takes into account the data labels in computing the optimal projection matrix [19].
Given the dataset where each data sample belongs to one of classes such that if belongs to the -th class then is 1 where , the input data is partitioned into groups where denotes the sample set of the -th class which contains data points. LDA maps the features space to a lower dimensional space () through a linear transformation [216]. The class mean of the -th class is given by while the global mean in given by . In the projected space the class mean is given by while the global mean in given by .
The within-class scatter and the between-class scatter evaluate the class separability and are defined as and respectively such that
[TABLE]
The within-class scatter evaluates the spread of the data around the class mean while the between-class scatter evaluates the spread of the class means around the global mean. For the projected data, the within-class and the between-class scatters are defined as and respectively such that
[TABLE]
The LDA optimization problem is bi-objective where the within-class scatter should be minimized while the between-class scatter should be maximized. The optimal transformation can be obtained by maximizing the Fisher criterion (the ratio of between-class to within-class scatters)
[TABLE]
Note that since the between-class and the within-class scatters are not scalar, the determinant is used to obtain a scalar objective function. As discussed in [100], assuming that is invertible and non-singular, the Fisher criterion is optimized by selecting the largest eigenvalues of and the corresponding eigen vectors form the optimal transformation matrix . Instead of using Fisher criterion, bi-objective optimization techniques may also potentially be used to formulate and solve the LDA optimization problem exactly.
An alternative formulation of the LDA optimization problem is provided in [63] by maximizing the minimum distance between each class center and the total class center. The proposed approach known as the large margin linear discriminant analysis requires the solution of non-convex optimization problems. A solution approach is also proposed based on solving a series of convex quadratic optimization problems.
3.3 Decision Trees
Decision trees are classical models for making a decision or classification using splitting rules organized into a tree data structure. Tree-based methods are non-parametric models that partition the predictor space into sub-regions and then yield a prediction based on statistical indicators (e.g., median and mode) of the segmented training data. Decision trees can be used for both regression and classification problems.
For regression trees, the splitting of the training dataset into distinct and non-overlapping regions can be done using a top-down recursive binary splitting procedure. Starting from a root node that contains the full dataset, a cut that splits the data into distinct sets is identified. For the case of a univariate cut (i.e., involving only a single feature), the cutpoint for feature is the one that leads to the two splitted regions and that have the greatest possible reduction in the residual sum of squares where denotes the mean response for the training observations in region . A multivariate split is of the form , where is a vector. Another optimization criterion is the measure of purity [45] such as Gini’s index in classification problems. For classification problems, [45] highlights that, given their greedy nature, the classical methods based on recursive splitting do not lead to the global optimality of the decision tree. Since building optimal binary decision trees is known to be -hard [124], heuristic approaches based on mathematical programming paradigms, such as linear optimization [22], continuous optimization [23], and dynamic programming [9, 11, 75, 181], have been proposed.
To find provably optimal decision trees, [28] proposes a mixed integer programming formulation that has an exponential complexity in the depth of the tree. Given a fixed depth , the maximum number of nodes is indexed by . Following the notation of [28], the nodes are split into two sets, branch nodes and leaf nodes. The branch nodes apply a linear split where the left child node includes the data points that satisfy this split while the right one includes the remaining data. In [28], the splits that are applied at the branch nodes are restricted to a single variable with the option of not splitting a node. The binary decision variable takes a value of 1 if branch node is split and 0 otherwise. Since the splits are univariate, then variable , which denotes the value of the coefficient of feature in the split at node , is also binary. The cutpoint at node is .
At each of the leaf nodes , a class prediction is made based on the data points that are included. The binary variable indicates if data point is included to leaf node , i.e., or otherwise . The binary decision variable takes a value of 1 if leaf node is assigned label , and 0 otherwise while binary variable indicates if leaf node is used, i.e., or otherwise .
The mixed integer programming formulation is
[TABLE]
The objective function (22) minimizes the normalized total misclassification loss and the decision tree complexity which is given by , the total number of nodes that are split. is a tuning parameter and is the baseline loss obtained by predicting the most popular class from the entire dataset. Constraints (23)–(24) set the misclassification loss at leaf node as if node is assigned label (i.e ), where is the total number of data points at leaf node and is the total number of data points at node whose true labels are . The counting of and is enforced by (25) and (26), respectively, where is a parameter taking the value of 1 if data point has a label and otherwise. Constraints (27) indicate that each leaf node that is used (i.e., ) should be assigned to a label . Constraints (28) indicate that each data point should be assigned to exactly one leaf node. Constraints (29)–(30) indicate that data points can be assigned to a node only if that node is used and if a node is used then at least data points should be assigned to it. The splitting of the data points at each of the branch nodes is enforced by constraints (31)–(32) where is the set of ancestors of whose left branch has been followed on the path from the root node to node . Similarly, is the set of ancestors of whose right branch has been followed on the path from the root node to node . and are small numbers to enforce the strict split at the left branch (see [28] for finding good values for and ). Constraints (33)–(34) indicate that the splits are restricted to a single variable with the option of not splitting a node (). As enforced by constraints (35), if , the parent of node , does not apply a split then so is node . Finally constraints (36)–(38) set the binary conditions.
An alternative formulation to the optimal decision tree problem is provided in [114]. The main difference between the formulation of [114] and [28] is that the approach of [114] is specialized to the case where the features take categorical values. By exploiting the combinatorial structure that is present in the case of categorical variables, [114] provides a strong formulation of the optimal decision tree problem thus improving the computational performance. Furthermore the formulation of [114] is restricted to binary classification and the tree topology is fixed, which lowers the required computational effort for solving the optimization problem to optimality. A commonality between the models presented in [28] and [114] is that the split that is considered at each node of the decision tree involves only one variable mainly to achieve better computational performance when solving the optimization model. More generally, splits that span multiple variables can also be considered at each node as presented in [38, 211, 212]. The approach of [38], which is extended in [39] to account for sparsity by using regularization, is based on a nonlinear continuous optimization formulation to learn decision trees with general splits.
While single decision tree models are often preferred by data analysts for their high interpretability, the model accuracy can be largely improved by taking multiple decision trees into account. Such approaches include bagging, random forests, and boosting. Bagging creates multiple decision trees by obtaining several training subsets by randomly choosing with replacement data points from the training set and subsequently training a decision tree for each subset. Random forests create training subsets similar to bagging with the addition of randomly selecting a subset of features for training each tree. Boosting iteratively creates decision trees where a weight on the training data is set and is increased at each iteration for the misclassified data points so as to subsequently create a decision tree that is more likely to correctly classify previously misclassified data. These models that make predictions based on aggregating the predictions of individual trees are also known as tree ensemble. A mixed integer optimization model for tree ensemble has been recently proposed in [168].
Decision trees can also be used in a more general range of applications as algorithms for problem solving, data mining, and knowledge representation. In [10], several greedy and dynamic programming approaches are compared for building decision trees on datasets with inconsistent labels (i.e, many-valued decision approach). Many-valued decisions can be evaluated in terms of multiple cost functions in a multi-stage optimization [12]. Recently, [67] investigated conflicting objectives in the construction of decision trees by means of bi-criteria optimization. Since the single objectives, such as minimizing average depth or the number of terminal nodes, are known to be -hard, the authors propose a bi-criteria optimization approach by means of dynamic programming.
3.4 Support Vector Machines
Support vector machines (SVMs) are another class of supervised machine learning algorithms that are based on statistical learning and have received significant attention in the optimization literature [59, 209, 210]. Given a training set with training inputs where and binary response variables , the objective of the support vector machine problem is to identify a hyperplane , where and , which separates the two classes of data points with a maximal separation margin measured as the width of the band that separates the two classes. In this section, denotes the vector of coefficients corresponding to each of the input variables and is the intercept of the separating hyperplane. As detailed next, the underlying optimization problem is a linearly constrained convex quadratic optimization problem.
3.4.1 Hard Margin SVM
The most basic version of SVMs is the hard margin SVM that assumes that there exists a hyperplane that geometrically separates the data points into the two classes such that no data point is misclassified [73]. The training of the SVM model involves finding the hyperplane that separates the data and whose distance to the closest data point in either of the classes, i.e., margin, is maximized.
The distance of a particular data point to the separating hyperplane is
[TABLE]
The distance to the closest data point is normalized to where denotes the -norm. Thus the data points with labels are on one side of the hyperplane such that while the data point with labels are on the other side . The optimization problem for finding the separating hyperplane is then
[TABLE]
which is equivalent to
[TABLE]
that is a convex quadratic problem.
Forcing the data to be separable by a linear hyperplane is a strong condition that often does not hold in practice. Thus, the soft-margin SVM, which relaxes the condition of perfect separability, is used instead.
3.4.2 Soft-Margin SVM
When the data is not linearly separable, problem (39)–(41) is infeasible. Alternatively, [24] proposed a linear program that minimizes a weighted average of the errors given by the points lying on the wrong side of the separator. This work was then extended in [73] which presented the soft margin SVM. The soft margin SVM introduces slack variables into constraints (40) which are then penalized in the objective function as a proxy to minimizing the number of data points that are on the wrong side. The soft-margin SVM optimization problem is
[TABLE]
Another common alternative is to include the error term in the objective function by using the squared hinge loss instead of the hinge loss . The hinge loss function takes a value of zero for a data point that is correctly classified while it takes a positive value that is proportional to the distance from the separating hyperplane for a misclassified data point. Hyperparameter is then tuned to obtain the best classifier.
Besides the direct solution of problem (42)–(45) as a convex quadratic problem, replacing the -norm by the -norm leads to a linear optimization problem generally at the expense of higher misclassification rate [44].
3.4.3 Sparse SVM
Using the -norm is also an approach to sparsify , i.e., reduce the number of features that are involved in the classification model [44, 224]. An approach known as the elastic net includes both the -norm and the -norm in the objective function and tunes the bias towards one of the norms through a hyperparameter [217, 228]. Several other approaches for dealing with sparsity in SVM have been proposed in [8, 88, 103, 105, 115, 164, 183]. The number of features can be explicitly modeled in (42)–(45) by using binary variables where indicates that feature is selected and otherwise [60]. A constraint limiting the number of features to a maximum desired number can be enforced resulting in the following mixed integer quadratic problem
[TABLE]
Constraints (48) force when feature is used, i.e., ( denotes a sufficiently large number). Constraints (49) set a limit on the maximum number of features that can be used.
3.4.4 The Dual Problem and Kernel Tricks
The data points can be mapped to a higher dimensional space through a mapping function . A soft margin SVM can then be applied such that
[TABLE]
Through this mapping, the data has a linear classifier in the higher dimensional space however a nonlinear separation function is obtained in the original space.
To solve problem (53)–(56), the following dual problem is first obtained
[TABLE]
where are the dual variables of constraints (54). Given a kernel function where , the dual problem is
[TABLE]
which is a convex quadratic optimization problem. Thus only the kernel function is required while the explicit mapping is not needed.
Among the commonly used kernel functions is the polynomial where controls the trade-off between the higher-order and the lower-order terms in the polynomial and is the degree of the polynomial. The polynomial kernel models the interaction between the data up to the degree . A high degree polynomial tends to overfit the training data. Another kernel function is the radial basis where acts as a smoothing parameter. A smaller tends to overfit the training data. The sigmoidal kernel is also commonly used where is a scaling parameter of the input data and is a shifting parameter that controls the threshold of the mapping. Further details on kernel functions are provided in [2, 59, 122].
Since the classification in high dimensional space can be difficult to interpret for practitioners, Binarized SVM (BSVM) replaces the continuous predictor variables with a linear combination of binary cutoff variables [56]. BSVM is also extended in [57] to capture the interactions between the relevant variables. Another important practical aspect to consider is data uncertainty. Often the training data suffers from inaccuracies in the labels and in the features that are collected which may negatively affect the performance of the classifiers. While typically regularization is used to mitigate the effect of uncertainty, [29] proposes robust optimization models for logistic regression, decision trees, and support vector machines and shows increased accuracy over regularization, and most importantly without changing the complexity of the classification problem.
3.4.5 Support Vector Regression
Although as discussed earlier, SVM has been introduced for binary classification, its extension to regression, i.e., support vector regression, has received significant interest in the literature [197]. The core idea of support vector regression is to find a linear function that can approximate with a tolerance a training set where [210]. Such a linear function may however not exist, and thus slack variables and denoting positive and negative deviations from the desired tolerance are introduced and minimized similar to the soft-margin SVM. The corresponding optimization problem is
[TABLE]
Hyperparameter is tuned to adjust the weight on the deviation from the tolerance . This deviation from is the -insensitive loss function given by
[TABLE]
As detailed extensively in [197], kernel tricks can also be applied to (57)–(61) which is solved by formulating the dual problem.
3.4.6 Support Vector Ordinal Regression
In situations where the data contains ordering preferences, i.e., the training data is labeled by ranks, where the order of the rankings is relevant while the distances between the ranks is not defined or irrelevant to the training, the purpose of learning is to find a model that maps the preference information.
The application of classic regression models for such type of data requires the transformation of the ordinal ranks to numerical values. However, such approaches often fail in providing robust models as an appropriate function to map the ranks to distances is challenging to find [145]. An alternative is to encode the ordinal ranks into binary classifications at the expense of a large increase in the scale of the problems [118, 121].
An extension of SVM for ordinary data has been proposed in [195] and extended in [69]. Given a training dataset with ordered categories where is the number of data points labeled as order , the support vector ordinal regression finds separating parallel hyperplanes where is the threshold associated with the hyperplane that separates the orders from the remaining orders. Thus , the data sample of order , should have a function value lower than the margin while the data samples with orders should have a function value greater than the margin . The errors for violating these conditions are given by and respectively. Following [69], the associated SVM formulation is
[TABLE]
As detailed in [69], kernel tricks can be also applied by considering the dual problem. Finally we note that preference modeling using machine learning has several commonalities with various approaches in multi-criteria decision analysis and most notably, robust ordinal regression. We refer the readers to [72] for a detailed comparison between preference learning using machine learning and muti-criteria decision making.
4 Clustering
Data clustering is a class of unsupervised learning approaches that has been widely used, particularly in applications of data mining, pattern recognition, and information retrieval. Given an input , which includes unlabeled observations with , cluster analysis aims at finding subsets of , called clusters, which are homogeneous and well separated. Homogeneity indicates the similarity of the observations within the same cluster (typically, by means of a distance metric), while the separability accounts for the differences between entities of different clusters. The two concepts can be measured via several criteria and lead to different types of clustering algorithms (see, e.g., [117]). The number of clusters is typically a tuning parameter to be fixed before determining the clusters. An extensive survey on data clustering analysis is provided in [128].
In case the entities are points in a Euclidean space, the clustering problem is often modeled as a network problem and shares many similarities with classical problems in operations research, such as the -median problem [20, 143, 163, 173].
In the following subsections, the commonly used minimum sum-of-squares clustering, the capacitated clustering, and the -hyperplane clustering are discussed.
4.1 Minimum Sum-Of-Squares Clustering (a.k.a. -Means Clustering)
Minimum sum-of-squares clustering is one of the most commonly adopted clustering algorithms. It requires to find a number of disjoint clusters for observations , where such that the distance to cluster centroids is minimized. Given that typically the number of clusters is a-priori fixed, the problem is also referred to as -means clustering. The decision of the cluster size is typically taken by examining the elbow curve, or similarity indicators, such as silhouette values and Calinski-Harabasz index, or via mathematical programming approaches including the maximization of the modularity of the associated graph [50, 51].
Defining the binary variables
[TABLE]
and the centroid of each cluster , the problem of minimizing the within-cluster variance is formulated in [3] as the following mixed integer nonlinear program
[TABLE]
By introducing the variables which denote the distance of observation from centroid , the following linearized formulation is obtained
[TABLE]
Parameter is a sufficiently large number. A heuristic solution approach based on the gradient method is proposed for problem (62)–(65) in [13]. Alternatively, a column generation approach for large-scale instances has been proposed in [3] and a bundle approach has been presented in [132].
The case where the space is not Euclidean is considered in [58]. Alternatively, [189] presents the Heterogeneous Clustering Problem (HCP) where the observations to cluster are associated with multiple dissimilarity matrices. HCP is formulated as a mixed integer quadratically constrained quadratic program. Another variant is presented in [188] where the homogeneity is expressed by the minimization of the maximum diameter of the clusters. The resulting nonconvex bilinear mixed integer program is solved via a graph-theoretic approach based on seed finding.
Many common solution approaches for -means clustering are based on heuristics. A popular method implemented in data science packages (e.g., scikit-learn [182]) is the two-step improvement procedure proposed in [161]. Starting from a sample of points in set as initial cluster centers (centroids ), at each iteration , the algorithm assigns each point in to the nearest centroid and then computes the centroids of the new partition. The procedure is guaranteed to decrease the within-cluster variance and it is run until this metric is sufficiently low. Given the dependency of the procedure to the choice of , typically the clustering is repeated with different initial centroids and the best clusters are selected. Other heuristics relax the assumption to produce exactly clusters. For instance, [161] merges clusters if their centroids are sufficiently close. Clustering is also used within heuristics for hard combinatorial problems ([101, 149]), and can be integrated in problems where the evaluation of multiple solutions is important (e.g., Cluster Newton Method [6, 104]). Cluster Newton method approximates the Jacobian in the domain covered by the cluster of points, instead of locally as done by the traditional Newton’s Method [136], and this has a regularization effect.
4.2 Capacitated Clustering
The Capacitated Centered Clustering Problem (CCCP) deals with finding a set of clusters with a capacity limitation and homogeneity expressed by the similarity to the cluster centre. Given a set of potential clusters , a mathematical formulation for CCCP is given in [175] as
[TABLE]
Parameter is an upper bound on the number of clusters, is the dissimilarity measure between observation and cluster , is the weight of observation , and is the capacity of cluster . Variable denotes the assignment of observation to cluster and variable is equal to if cluster is used. If the metric is a distance and the clusters are homogeneous (i.e., ), the formulation also models the well-known facility location problem. A solution approach is discussed in [62] while an alternative quadratic programming formulation is presented in [154]. Solution heuristics have also been proposed in [163] and [191].
4.3 -Hyperplane Clustering
In the -Hyperplane Clustering (-HC) problem, a hyperplane, instead of a center, is associated with each cluster. This is motivated by applications such as text mining and image segmentation, where collinearity and coplanarity relations among the observations are the main interest of the unsupervised learning task, rather than the similarity. Given the observations , the -HC problem requires to find clusters, and a hyperplane , with and , for each cluster . The aim is to minimize the sum of the squared -norm Euclidean orthogonal distances between each observation and the corresponding cluster.
Given that the orthogonal distance of to hyperplane is given by , -HC is formulated in [4] as the following mixed integer quadratically constrained quadratic problem
[TABLE]
Binary variable is equal to if point is assigned to cluster , and [math] otherwise. Linear constraints (73)–(74) set as the distance between point and the hyperplane of cluster . These constraints are enforced only if is equal to , otherwise they are redundant. The non-convexity is due to constraint (75). As a solution approach, a distance-based reassignment heuristic that outperforms spatial branch-and-bound solvers is proposed in [4].
5 Linear Dimension Reduction
In Section 2.2, shrinkage methods have been discussed as a way to improve model interpretability by fitting a model with all original predictors. In this section, we discuss dimension reduction methods that search for linear combinations of the predictors such that (also called projections) where denotes column of X, i.e., the vector of values of feature of the training set. While this section focuses on Principle Component Analysis and Partial Least Squares, we note that other linear and nonlinear dimension reduction methods exist. An extensive survey on benefits and shortcomings of dimension reduction methods is presented in [76].
5.1 Principal Components
Principal Components Analysis (PCA) [131] aims to find a low-dimensional representation of the dataset with highly informative derived features. Principal components are ordered in terms of their explained variances, which measure the amount of information retained from the original set of features .
In particular, assuming the regressors are standardized to a mean of [math] and a variance of , the direction of the first principal component is a unit vector that is the solution of the optimization problem
[TABLE]
Problem (79)–(80) is the traditional formulation of PCA and can be solved via Lagrange multipliers methods. Since the formulation is sensitive to the presence of outliers, several approaches have been proposed to improve robustness [185]. One approach is to replace the -norm in (79) with the -norm.
An iterative approach can be used to obtain the principal components where the first principal component is the projection of the original features with the largest variability. The subsequent principal components are obtained iteratively where each principal component is obtained by a linear combination of the feature columns . Each is uncorrelated with which have larger variance. Introducing the sample covariance matrix of the regressors , the direction of the -th principal component is the solution of
[TABLE]
PCA can be used for several data analysis problems which benefit from reducing the problem dimension. Principal Components Regression (PCR) is a two-stage procedure that uses the first principal components as predictors for a linear regression model. PCR has the advantage of including less predictors than the original set and at the same time retaining the variability of the dataset in the derived features. However, principal components might not be relevant with the response variables of the regression.
To select principal components in regression models, the regression loss function and the PCA objective function can be combined in a single-step quadratic programming formulation [135].
Since the identification of the principal components does not require any knowledge of the response , PCA can also be adopted in unsupervised learning such as in the -means clustering method (see Section 4.1, [85]). A known drawback of PCA is interpretability. To promote the sparsity of the projected components, and thus make them more interpretable, [55] formulates a Mixed Integer Nonlinear Programming (MINLP) problem and shows that the level of sparsity can be imposed in the model. Alternatively, the variance of the principal components and their sparsity can be jointly maximized in a biobjective framework [54].
5.2 Partial Least Squares
Partial Least Squares (PLS) identifies transformed features by projecting both the predictors and their corresponding response into a new space, and this is an approach specific to regression problems [99]. PLS is particularly viable for problems with a large number of features compared to observations as it aims to identify the latent factors that explain most the variations in the response. PLS corresponds to fitting simple regression models each containing a single predictor variable.
The first PLS direction is denoted by where each component is found by fitting a regression with predictor and response . The first PLS direction points towards the features that are more strongly related to the response. For computing the second PLS direction, the features vectors are first orthogonalized with respect to (as per the Gram-Schmidt approach), and then individually fitted in simple regression models with response . The process is iterated for all PLS directions . The coefficient of the simple regression of onto each original feature can also be computed as the inner product . Similar to PCR, PLS then fits a linear regression model with regressors and response .
While the principal components directions maximize variance, PLS searches for directions with both high variance and high correlation with the response. The -th direction can be found by solving the optimization problem
[TABLE]
where Corr() indicates the correlation matrix, Var() the variance, the sample covariance matrix of , and (86) ensures that is uncorrelated with the previous directions .
6 Deep Learning
Deep Learning received a first momentum until the 80s due to universal approximation results [79, 123]. Neural networks with a single layer with a finite number of units can represent any multivariate continuous function on a compact subset in with arbitrary precision. However, the computational complexity required for training Deep Neural Networks (DNNs) hindered their diffusion by late 90s. Starting 2010, the empirical success of DNNs has been widely recognized for several reasons, including the development of advanced processing units, namely GPUs, the advances in the efficiency of training algorithms such as backpropagation, the establishment of proper initialization parameters, and the massive collection of data enabled by new technologies in a variety of domains (e.g., healthcare, supply chain management [205], marketing, logistics [215], Internet of Things).
DNNs can be used for the regression and classification tasks discussed in the previous sections, especially when traditional machine learning models fail to capture complex relationships between the input data and the quantitative response, or class, to be learned. The aim of this section is to describe the decision optimization problems associated with DNN architectures. To facilitate the presentation, the notation for the common parameters is provided in Table 1, and an example of fully connected feedforward network is shown in Figure 1.
The output vector of a DNN is computed by propagating the information from the input layer to each following layer via the weight matrices , the bias vectors , and the activation function , such that
[TABLE]
Activation functions indicate whether a neuron should be activated or not in the network, and are responsible for the capability of DNNs to learn complex relationships between the input and the output. The rectified linear unit
[TABLE]
is typically one of the preferred options for activation functions, mainly because it can be optimized with gradient-based methods for DNN training, and tends to produce sparse networks (where not all neurons are activated).
In the context of regression, the components of can directly represent the response values learned. For a classification problem, the vector corresponds to the logits of the classifier. In order to interpret as a vector of class probabilities, functions such as the logistic sigmoidal or the softmax can be applied to [108]. The classifier modeled by the DNN then classifies an input with the label correspondent to the maximum activation
The task of training a DNN consists of determining the weights and the biases that make the model best fit the training data, according to a certain measure of training loss. In multivariate regression with response variables [126, 166], the training loss is typically the sum-of-squared errors where denotes response corresponding to the -th input vector. For classification with classes, cross-entropy is preferred. An effective approach to minimize is by gradient descent, called back-propagation in this setting. Typically, one is not interested in a proven local minimum of , as this is likely to overfit the training dataset and yield a learning model with a high variance. Similar to the Ridge regression (see Section 2), the loss function can include regularization terms, such as a weight decay term
[TABLE]
or alternatively a weight elimination penalty term
[TABLE]
Weight decay limits the growth of the weights, which speeds up the training via backpropagation, and has been shown to limit overfitting (see [184] for a discussion about overfitting in Neural Networks).
The aim of this section is to present the optimization models that are used in DNN for feedforward architectures. Several other neural network architectures have been investigated in deep learning [108]. In particular, Convolutional Neural Networks (CNN) [152] have been successfully adopted for processing data with a grid-like topology, such as images [147], videos [133], and traffic analytics [219]. In CNN, the output of layers is obtained via convolutions (instead of the matrix multiplication in feedforward networks), and pooling operations on nearby units (such as average or maximum operators). In the remainder of the section, mixed integer programming models for DNN training are introduced in Section 6.1, and ensemble approaches with multiple activation functions are discussed in Section 6.2.
6.1 Mixed Integer Programming for DNN Architectures
Motivated by the considerable improvements of mixed integer programming solvers, a natural question is how to model a trained DNN as a MIP. In [97], DNNs with ReLU activation
[TABLE]
are modeled as a MIP with decision variables expressing the output vector of layer , and is the input vector. To express (88) explicitly, each unit of the DNN is associated with binary activation variables , and continuous slack variables . The following mixed integer linear problem is proposed
[TABLE]
where are suitably large constants. We note that since the DNN is trained, the weights and bias are fixed parameters. Depending on the application, different activation weights and activation costs can also be used for each . If known, upper bound can be enforced on the output of unit via constraints (93), and slack can be bounded by via constraints (94).
The proposed MIP is feasible for every input vector since it computes the activation in the subsequent layers. Constraints (91) and (92) are known to have a weak continuous relaxation, and the tightness of the chosen constants (bounds) is crucial for their effectiveness. Several optimization solvers can directly handle such kind of constraints as indicator constraints [40]. In [97], a bound-tightening strategy to reduce the computational times is proposed and the largest DNN tested with this approach is a -layer DNN with internal units.
Problem (89)–(94) can model several tasks in Deep Learning, other than the computation of quantitative responses in regression, and of classification. Such tasks include
- •
Pooling operations: The average and the maximum operators
[TABLE]
can be incorporated in the hidden layers. In the case of max pooling operations, additional indicator constraints are required. For example, average and maximum operators are often used in CNNs, as mentioned earlier in Section 6.
- •
Maximizing the unit activation: By maximizing the objective function (89), one can find input examples that maximize the activation of the units. This may be of interest in applications such as the visualization of image features.
- •
Building crafted adversarial examples: Given an input vector labeled as by the DNN, the search for perturbations of that are classified as (adversarial examples), can be conducted by adding conditions on the activation of the final layer and minimizing the perturbation. In [97], such conditions are actually restricting the search for adversarial examples and the resulting formulation does not guarantee an adversarial solution nor can prove that no adversarial examples exist. Adversarial learning is the objective of the discussion in Section 7.
- •
Training: In this case, the weights and biases are decision variables. The resulting bilinear terms in (90) and the considerable number of decision variables in the formulation limit the applicability of (89)–(94) for DNN training.
Another attempt in modelling DNNs via MIPs is provided by [140], in the context of Binarized Neural Networks (BNNs). BNNs are characterized by having binary weights and by using the sign function for neuron activation [74]. In [140], a MIP is proposed for finding adversarial examples in BNNs by maximizing the difference between the activation of the targeted label and the predicted label of the input , in the final layer (namely, ). Contrary to [97], the MIP of [140] does not impose limitations on the search of adversarial examples, apart from the perturbation quantity. In terms of optimality criterion however, searching for the proven largest misclassified example is different from finding a targeted adversarial example. Furthermore, while there is interest in minimally perturbed adversarial examples, suboptimal solutions corresponding to adversarial examples (i.e., ) may have a perturbation smaller than that of the optimal solution. Recently, [125] investigated a hybrid constraint programming/mixed integer programming method to train BNNs. Such model-based approach provides solutions that generalize better than those found by the largely adopted training solvers, such as gradient descent, especially for small datasets. We note that methods such as gradient descent can usually only guarantee local optimality (unless early stopping takes place).
Besides [97], other MIP frameworks have been proposed to model certain properties of neural networks in a bounded input domain. In [65], the problem of computing maximum perturbation bounds for DNNs is formulated as a MIP, where indicator constraints and disjunctive constraints are modeled using constraints with big-M coefficients [111]. The maximum perturbation bound is a threshold such that the perturbed input may be classified correctly with a high probability. A restrictive misclassification condition is added when formulating the MIP. Hence, the infeasibility of the MIP does not certify the absence of adversarial examples. In addition to the activation, the function is also considered by introducing quadratic constraints and several heuristics are proposed to solve the resulting problem. In [206], a model to formally measure the vulnerability to adversarial examples is proposed (the concept of vulnerability of neural networks is discussed in more details in Sections 7.1 and 7.2). A tight formulation for the resulting nonlinearities and a novel presolve technique are introduced to limit the number of binary variables and improve the numerical conditioning. However, the misclassification condition of adversarial examples is not explicitly defined but is rather left in the form “different from” and not explicitly modeled using equality/inequality constraints. In [193], the aim is to count or bound the number of linear regions that a piecewise linear classifier represented by a DNN can attain. Assuming that the input space is bounded and polyhedral, the DNN is modeled as a MIP. The contributions of adopting a MIP framework in this context are limited, especially in comparison with the computational results achieved in [170].
MIP frameworks can also be used to formulate the verification problem for neural networks as a satisfiability problem. In [134], a satisfiability modulo theory solver is proposed based on an extension of the simplex method to accommodate the activation functions. In [48], a branch-and-bound framework for verifying piecewise-linear neural networks is introduced. For a recent survey on the approaches for automated verification of NNs, the reader is referred to [153].
6.2 Activation Ensembles
Another research direction in neural network architectures investigates the possibility of adopting multiple activation functions inside the layers of a neural network, to increase the accuracy of the classifier. Some examples in this framework are given by the maxout units [110], returning the maximum of multiple linear affine functions, and the network-in-network paradigm [156] where the activation function is replaced by fully connected network. In [1], adaptive piecewise linear activation functions are learned when training each neuron. Specifically, for each unit and value , activation is considered as
[TABLE]
where the number of hinges is a hyperparameter to be fixed before training, while the variables have to be learned. Functions generalize the function (first term of (95)), and can approximate a class of continuous piecewise-linear functions, for large enough [1].
In a more general perspective, ensemble layers are proposed in [142] to consider multiple activation functions in a neural network. The idea is to embed a family of activation functions and let the network itself choose the magnitude of their activation for each neuron during the training. To promote relatively equal contribution to learning, the activation functions need to be scaled to the interval . In order to measure the impact of the activation in the neural network, each function is associated with a continuous variable . The resulting activation for neuron is then given by
[TABLE]
where is the output of neuron associated with training example , is the set of training observations, and is a small tolerance. Equation (96) is a weighted sum of the scaled functions, which is integrated in the training of the DNN architecture. The and in (96) can be approximated on a minibatch of observations in , in the testing phase. In order to impose the selection of functions , the magnitude of the weights is limited in a projection subproblem, where for each neuron the network should choose an activation function and therefore all should sum to . If are the weight values obtained by gradient descent while training, then the projected weights are found by solving the convex quadratic programming problem
[TABLE]
which can be solved in closed form via the Karush-Kuhn-Tucker (KKT) conditions.
7 Adversarial Learning
Despite the wide adoption of Machine Learning models in real-world applications, their integration into safety and security related use cases still necessitates thorough evaluation and research. A large number of contributions in the literature pointed out the dangers caused by perturbed examples, also called adversarial examples, causing classification errors [35, 201]. Malicious attackers can thus exploit security falls in a general classifier. In case the attacker has a perfect knowledge of the classifier’s architecture (i.e., the result of the training phase), then a white-box attack can be performed. Black-box attacks are instead performed without full information of the classifier. The interest in adversarial examples is also motivated by the transferability of the attacks to different trained models [148, 208]. Adversarial learning then emerges as a framework to devise vulnerability attacks for classification models [160].
From a mathematical perspective, such security issues have been formerly expressed via min-max approaches where the learner’s and the attacker’s loss functions are antagonistic [82, 106, 150]. Non-antagonistic losses are formulated as a Stackelberg equilibrium problem involving a bilevel optimization formulation [47], or in a Nash equilibrium approach [46]. These theoretical frameworks rely on the assumption of expressing the actual problem constraints in a game-theory setting, which is often not a viable option for real-life applications.
The search for adversarial examples can also be used to evaluate the efficiency of Generative Adversarial Networks (GANs) [109]. A GAN is a minmax two-player game where a generative model tries to reproduce the training data distribution and a discriminative model estimates the probability of detecting samples coming from the true training distribution, rather than . The game terminates at a saddle point, which is a minimum with respect to a player’s strategy and a maximum for the other player’s strategy. Discriminative networks can be affected by the presence of adversarial examples because the specific inputs to the classification networks are not considered in GANs training.
Adversarial attacks on the test set can be conducted in a targeted or untargeted fashion [53]. In the targeted setup, the attacker aims to achieve a classification with a chosen target class (discussed in Section 7.1), while the untargeted misclassification is not constrained to achieve a specific class (Section 7.2). The robustness of DNNs to adversarial attacks is discussed in Section 7.3. Finally, data poisoning attacks are described in Section 7.4. While the majority of the cited papers of the present section refer to DNN applications, adversarial learning can, in general, be formulated for classifiers with quantitative classes, such as those discussed in Section 3.
7.1 Targeted attacks
Given a neural network classifier , an input with label , and a target label , a targeted attack consists of a perturbation such that .
This corresponds to finding an input “close” to , which is misclassified by . Clearly, if the target coincides with , the problem has the trivial solution and no misclassification takes place.
The minimum adversarial problem for targeted attacks consists of finding a perturbation by solving
[TABLE]
The condition (102) ensures that the perturbed example belongs to the set of admissible inputs. The difficulty of solving problem (100)–(102) to optimality depends on the complexity of the classifier , and the set of feasible inputs. In general, it is computationally challenging to find an optimal solution to the problem, especially in the case of neural networks.
For classification of normalized images with binary pixel values, [201] introduces the box-constrained approximation
[TABLE]
where denotes the loss function for training (e.g., cross-entropy). The approximation is exact for convex loss functions, and can be solved via a line search algorithm on . For a fixed , the formulation can be tackled by the box-constrained version of the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) method [49]. In [112], is fixed such that the perturbation is minimized on a sufficiently large subset of data points, and the mean prediction error rate of is greater than a threshold. In [53], the -norm in (100) is generalized to the -norm with and an alternative formulation is introduced which includes functions in the objective where is satisfied if and only if . The equivalent formulation is then
[TABLE]
where is a constant that can be determined by binary search such that the solution satisfies the condition . For the case where (106) are box constraints similar to (104), the authors propose strategies for applying optimization algorithms such as Adam [141]. Novel classes of attacks are identified for the considered metrics.
7.2 Untargeted attacks
In untargeted attacks, one searches for adversarial examples close to the original input with label for which the classified label of is different from , without targeting a specific label for . Given that the only aim is misclassification, untargeted attacks are deemed less powerful than the targeted counterpart, and received less attention in the literature.
A mathematical formulation for finding minimum adversarial distortion for untargeted attacks is proposed in [206]. Assuming that the output values of classifier are expressed by the functions associated with labels (i.e., are the scoring functions), and a distance metric is given, then a minimum perturbation for an untargeted attack is found by solving
[TABLE]
This formulation can easily accommodate targeted attacks in a set by replacing (108) with . The most commonly adopted metrics in the literature are the , , and -norm, which can all be expressed with continuous variables, as shown in [206]. The -norm makes the objective function of the outer-level optimization problem quadratic.
In order to express the logical constraint (108) in a mathematical programming formulation, problem (107)–(109) can be cast as the bilevel optimization problem
[TABLE]
where is a small constant, is a decision variable representing the classified label, is a big-M coefficient, and is a binary variable that enforces one of the constraints (111)–(112) which express the condition of misclassification . The complexity of the inner-level optimization problem is dependent on the scoring functions. Given that the upper-level feasibility set is typically continuous and the lower-level variable ranges on a discrete set, the problem is in fact a continuous discrete bilevel programming problem [93] with convex quadratic function [90], which requires dedicated reformulations or approximations [64, 113, 130].
We introduce an alternative mathematical formulation for finding untargeted adversarial examples satisfying condition (108). A perturbed input for a sample classified with label is an untargeted adversarial example if the classified label of is different from . This condition is equivalent to
[TABLE]
Condition (116) is an existence condition, which can be formalized by introducing the functions , , and the condition
[TABLE]
where parameter enforces that at least one function has to be activated for a perturbation . Therefore, untargeted adversarial examples can be found from formulation (107)–(109) by replacing condition (108) with the linear condition (117) and adding functions . The complexity of this approach depends on the scoring functions . The extra functions can be expressed as a mixed integer formulation as done in problem (89)–(94).
7.3 Adversarial robustness
Another interesting line of research motivated by adversarial learning deals with adversarial training, which consists of techniques to make a neural network robust to adversarial attacks. The problem of measuring the robustness of a neural network is formalized in [17]. The pointwise robustness evaluates if the classifier on is robust for “small” perturbations. Formally, is said to be -robust if
[TABLE]
Then, the pointwise robustness is the minimum for which fails to be -robust:
[TABLE]
As detailed in [17], is computed by expressing (119) as a constraint satisfiability problem. By imposing a bound on the perturbation, an estimation of the pointwise robustness can be performed by solving a MIP [65].
A widely known defense technique is to augment the training data with adversarial examples; this however does not offer robustness guarantees on novel kinds of attacks. The adversarial training of neural network via robust optimization is investigated in [162]. In this setting, the goal is to train a neural network to be resistant to all attacks belonging to a certain class of perturbations. Particularly, the adversarial robustness with a saddle point (min-max) formulation is studied in [162] which is obtained by augmenting the Empirical Risk Minimization paradigm.
Let be the set of model parameters to be learned, and be the loss function considered in the training phase (e.g., the cross-entropy loss) for training examples and labels , and let be the set of allowed perturbations (e.g., an ball). The aim is to minimize the worst expected adversarial loss on the set of inputs perturbed by
[TABLE]
where the expectation value is computed on the distribution of the training samples. The saddle point problem (120) is viewed as the composition of an inner maximization and an outer minimization problem. The inner problem corresponds to attacking a trained neural network by means of the perturbations . The outer problem deals with the training of the classifier in a robust manner. The importance of formulation (120) stems both from the formalization of adversarial training and from the quantification of the robustness given by the objective function value on the chosen class of perturbations. To find solutions to (120) in a reasonable time, the structure of the local minima of the loss function can be explored.
Another robust training approach consists of optimizing the model parameters with respect to worst-case data [194]. This is formalized by introducing a perturbation set for each training example . The aim is then to optimize
[TABLE]
An alternating ascent and descent steps procedure can be used to solve (121) with the loss function approximated by the first-order Taylor expansion around the training points.
7.4 Data Poisoning
A popular class of attacks for decreasing the training accuracy of classifiers is that of data poisoning, which was first studied for SVMs [36]. A data poisoning attack consists of hiding corrupted, altered or noisy data in the training dataset. In [200], worst-case bounds on the efficacy of a class of causative data poisoning attacks are studied. The causative attacks [15] proceed as follow:
- •
a clean training dataset with data points drawn by a data-generating distribution is generated
- •
the attacker adds malicious examples to , to let the defender (learner) learn a bad model
- •
the defender learns model with parameters from the full dataset , reporting a test loss .
Data poisoning can be viewed as a game between the attacker and the defender players, where the defender wants to minimize , and the attacker seeks to maximize it. As discussed in [200], data sanitization defenses to limit the increase of test loss include two steps: (i) data cleaning (e.g., removing outliers which are likely to be poisoned examples), to produce a feasible dataset , and (ii) minimizing a margin-based loss on the cleaned dataset . The learned model is then .
Poisoning attacks can also be performed in semi-online or online fashion, where training data is processed in a streaming manner, and not in fixed batches (i.e., offline). In the semi-online context, the attacker can modify part of the training data stream so as to maximize the classification loss, and the evaluation of the objective (loss) is done only at the end of the training. In the fully-online scenario, the classifier is instead updated and evaluated during the training process. In [218], a white-box attacker’s behavior in online learning for a linear classifier (e.g., SVM with binary labels ) is formulated. The attacker knows the order in which the training data is processed by the learner. The data stream arrives in instants (, with ) and the classification weights are updated using an online gradient descent algorithm [227] such that where is a regularization function, is the step length of the iterate update, and is a convex loss function. Let be the cleaned dataset at time (which can be obtained, for instance, via the sphere and slab defenses), be a given upper bound on the number of changed examples in due to data sanitization, be the attacker’s objective (e.g., classification error on the test set), be the cardinality of a set.
The semi-online attacker optimization problem can then be formulated as
[TABLE]
Compared to the offline case, the weights to be learned are a complex function of the data stream , which makes the gradient computation more challenging and the KKT conditions do not hold. The optimization problem can be simplified by considering a convex surrogate for the objective function, given by the logistic loss. In addition, the expectation is conducted over a separate validation dataset and a label inversion procedure is implemented to cope with the multiple local maxima of the classifier function. The fully-online case can also be addressed by replacing objective (122) with
8 Emerging Paradigms
8.1 Machine Teaching
In all Machine Learning tasks discussed so far, the size of the training set of the machine learning models has been considered as a hyperparameter. The Teaching Dimension problem identifies the minimum size of a training set to correctly teach a model [107, 196]. The teaching dimension of linear learners, such as Ridge regression, SVM, and logistic regression has been recently discussed in [157]. With the intent to generalize the teaching dimension problem to a variety of teaching tasks, [225] and [226] provide the Machine Teaching framework. Machine Teaching is essentially an inverse problem to Machine Learning. While in a learning task, the training dataset is given and the model parameters have to be determined, the role of a teacher is to let a learner approximately learn a given model by providing a proper set of training examples (also called teaching dataset in this context). A Machine Teaching task requires to select:
i) a Teaching Risk TR expressing the error of the learner, with respect to model ; ii) a Teaching Cost TC expressing the convenience of the teaching dataset, from the prospective of the teacher, weighted by a regularization factor ; iii) a learner L.
Formally, machine teaching can be cast as a bilevel optimization problem
[TABLE]
The upper optimization is the teacher’s problem and the lower optimization is the learner’s machine learning problem. The teacher is aware of the learner, which could be a classifier (such as those of Section 3) or a deep neural network. Machine teaching encompasses a wide variety of applications, such as data poisoning attacks, computer tutoring systems, and adversarial training.
Problem (125)–(126) is, in general, challenging to solve. However, for certain convex learners, one can replace the lower problem by the corresponding KKT conditions, and reduce the problem to a single level formulation. The teacher is typically optimizing over a discrete space of teaching sets, hence, for some problem instances, the submodularity properties of the problem may be explored. For problems with a small teaching set, it is possible to formulate the teaching problem as a mixed integer nonlinear program. The computation of the optimal training set remains, in general, an open problem, and is especially challenging in the case where the learning algorithm does not have a closed-form solution with respect to the training set [225].
The minimization of teaching cost can be directly enforced in the constrained formulation
[TABLE]
which allows for either approximate or exact teaching. Alternatively, given a teaching budget , the learning is performed via the constrained formulation
[TABLE]
Other variants consider multiple learners to be taught by the same teacher (i.e., common teaching set). The teacher can aim to optimize for the worst learner (minimax risk), or the average learner (Bayes risk). For the teaching dimension problem, the teaching cost is the cardinality of the teaching dataset, namely its [math]-norm. If the empirical minimization loss is guiding the learning process, and is the regularization weight, then teaching dimension problem can be formulated as
[TABLE]
Machine teaching approaches tailored to specific learners have also been explored in the literature. In [223], a method is proposed for the Bayesian learners, while [180] focuses on Generalized Context Model learners. In [165], the bilevel optimization of machine teaching is explored to devise optimal data poisoning attacks for a broad family of learners (i.e., SVM, logistic regression, linear regression). The attacker seeks the minimum training set poisoning to attack the learned model. By using the KKT conditions of the learner’s problem, the bilevel formulation is turned into a single level optimization problem.
8.2 Empirical Model Learning
Empirical model learning (EML) aims to integrate machine learning models in combinatorial optimization in order to support decision-making in high-complexity systems through prescriptive analytics. This goes beyond the traditional what-if approaches where a predictive model (e.g., a simulation model) is used to estimate the parameters of an optimization model. A general framework for an EML approach is provided in [159] and requires the following:
- •
A vector of decision variables , with feasible over the domain .
- •
A mathematical encoding of the Machine Learning model.
- •
A vector of observables obtained from .
- •
Logical predicates such as mathematical programming inequalities or combinatorial restrictions in constraint programming.
- •
A cost function .
EML then solves the following optimization problem
[TABLE]
The combinatorial structure of the problem is defined by (136), (137), and (139) while (138) embeds the empirical machine learning model in the combinatorial problem. Embedding techniques for neural networks and decision trees are presented in [159] using optimization approaches that include mixed integer nonlinear programming, constraint programming, and SAT Modulo Theories, and local search.
8.3 Bayesian Network Structure Learning
Bayesian networks are a class of models that represent cause-effect relationships. These networks are learned by deriving the causal relationships from data. A Bayesian network is visually represented as a direct acyclic graph where each of the nodes in corresponds to one variable and the edges are directional relations that indicate the cause and effect relationships among the variables. A conditional probability distribution is associated with every node/variables and along with the network structure expresses the conditional dependencies among all the variables. A main challenge in learning Bayesian networks is learning the network structure from the data. This is known as the Bayesian network structure learning problem. Finding the optimal Bayesian network structure is -hard [66]. Mixed integer programming formulations of the Bayesian network structure learning have been proposed [14] and solved by using relaxations [127], cutting planes [16, 52, 78], and heuristics [102, 222].
The case of learning Bayesian network structures when the width of the tree is bounded by a small constant is computationally tractable [177, 179]. The bounded tree-width case is thus a restriction on the Bayesian network structure that limits the ability to represent exactly the underlying distribution of the data with the aim to achieve reasonable computational performance when computing the network structure. Following [177], to formulate the Bayesian network structure learning problem with a maximum tree-width , the following binary variables are defined
[TABLE]
where and is a parent set for node . For each node , the collection of parent sets is denoted as and is assumed to be available (i.e., enumerated beforehand). Thus with , and where . Additional auxiliary variables , where denotes the number of nodes in , and are introduced to enforce the tree-width and directed acyclic graph conditions. The problem is formulated as
[TABLE]
The objective function (140) maximizes the score of the acyclic graph where is a score function that can be efficiently computed for every node [52]. Constraints (141)–(143) enforce a maximum tree-width while constraints (144)–(145) enforce the directed acyclic graph condition. Constraints (146)–(147) enforce the relationship between the and variables and finally constraints (148) set the variable bounds and binary conditions. Another formulation for the bounded tree-width problem has been proposed in [179] and includes an exponential number of constraints which are separated in a branch-and-cut framework. Both formulations however become computationally demanding as the number of features in the data set grows and with an increase in the tree-width limit. Several search heuristics have also been proposed as solution approaches [177, 176, 190].
9 Conclusions
Mathematical programming constitutes a fundamental aspect of many machine learning models where the training of these models is a large scale optimization problem. This paper surveyed a wide range of machine learning models namely regression, classification, clustering, and deep learning as well as the new emerging paradigms of machine teaching and empirical model learning. The important mathematical optimization models for expressing these machine learning models are presented and discussed. Exploiting the large scale optimization formulations and devising model specific solution approaches is an important line of research particularly benefiting from the maturity of commercial optimization software to solve the problems to optimality or to devise effective heuristics. However, as highlighted in [155, 184], providing quantitative performance bounds remains an open problem. The nonlinearity of the models, the associated uncertainty of the data, as well as the scale of the problems represent some of the very important and compelling challenges to the mathematical optimization community. Furthermore, bilevel formulations play a big role in adversarial learning [116], including adversarial training, data poisoning and neural network robustness.
Based on this survey, we summarize the distinctive features and the potential open machine learning problems that may benefit from the advances in computational optimization.
- •
Regression. The typical approaches to avoid overfitting and to handle uncertainty in the data include shrinkage methods and dimension reduction. These approaches can all be posed as mathematical programming models. General non-convex regularization to enforce sparsity without incurring shrinkage and bias (such as in lasso and ridge regularization) remain computationally challenging to solve to optimality. Investigating tighter relaxations and exact solution approaches continue to be an active line of research [7].
- •
Classification. Classification problems can also be naturally formulated as optimization problems. Support vector machines in particular have been well studied in the optimization literature. Similar to regression, classifier sparsity is one important approach to avoid overfitting. Additionally, exploiting the kernel tricks is key as nonlinear separators are obtained without additional complexity. However, when posed as an optimization problem, it is still unclear how to exploit kernel tricks in sparse SVM optimization models. Another advantage to express machine learning problems as optimization problems and in particular classification problems is to account for inaccuracies in the data. Handling data uncertainty is a deeply explored field in the optimization literature and several practical approaches have been presented to handle uncertainty through robust and stochastic optimization. Such advances in the optimization literature are currently being investigated to improve over the standard approaches [29].
- •
Clustering. Clustering problems are in general formulated as MINLPs that are hard to solve to optimality. The challenges include handling the non-convexity as well as the large scale instances which is a challenge even for linear variants such as the capacitated centred clustering (formulated as a binary linear model). Especially for large-scale instances, heuristics are typically devised. Exact approaches for clustering received less attention in the literature.
- •
DNNs architectures as MIPs. The advantage of mathematical programming approaches to model DNNs has only been showcased for relatively small size data sets due to the scale of the underlying optimization model. Furthermore, expressing misclassification conditions for adversarial examples in a non-restrictive manner, and handling the uncertainty in the training data are open problems in this context.
- •
Adversarial learning and adversarial robustness. Optimization models for the search for adversarial examples are important to identify and subsequently protect against novel sets of attacks. The complexity of the mathematical models in this context is highly dependent on the the classifier function. Untargeted attacks received less attention in the literature, and the mathematical programming formulation (110)–(114) has been introduced in section 7.2. Furthermore, designing models robust to adversarial attacks is a two-player game, which can be cast as a bilevel optimization problem. The loss function adopted by the learner is one main complexity for the resulting mathematical model and solution approaches remain to be investigated.
- •
Data poisoning: Similar to adversarial robustness, defending against the poisoning of the training data is a two-player game. The case of online data retrieval is especially challenging for gradient-based algorithms as the KKT conditions do not hold.
- •
Activation ensembles. Activation ensembles seek a trade-off between the classifier accuracy and computational feasibility of training with a mathematical programming approach. Adopting activation ensembles to train large DNNs have not been investigated yet.
- •
Machine teaching. Posed as a bilevel optimization problem, one of the challenges in machine teaching is to devise computationally tractable single-level formulations that model the learner, the teaching risk, and the teaching cost. Machine teaching also generalizes a number of two-player games that are important in practice including data poisoning and adversarial training.
- •
Empirical model learning. This emerging paradigm can be seen as the bridge combining machine learning for parameter estimation and operations research for optimization. As such, theoretical and practical challenges remain to be investigated to propose prescriptive analytics models jointly combining learning and optimization in practical applications.
While this survey does not discuss numerical optimization techniques since they were recently reviewed in [43, 77, 221], we note the fundamental role of the stochastic gradient algorithm [186] and its variants on large scale machine learning. We also highlight the potential impact of machine learning on advancing the solution approaches of mathematical programming [95, 96].
This survey has also focused on the learning process (loss minimization), however we note that challenging optimization problems also appear in the inference process, i.e., energy minimization (see [151] for a comprehensive survey). In the inference step, the best output is chosen from among all possible outputs given a certain input such that an “energy function” is minimized. The energy function provides a measure of the goodness of a particular configuration of the input and output variables. Energy optimization constitute a common framework for machine learning where the training of a model aims at finding the optimal energy function.
A key part of most machine learning approaches is the choice of the hyperparameters of the learning model. The Hyperparameter Optimization (HPO) is usually driven by the data scientist’s experience and the characteristics of the dataset and typically follows heuristic rules or cross-validation approaches. Alternatively, the HPO problem can be modeled as a box-constrained mathematical optimization problem [83], or as a bilevel optimization problem as discussed in [98, 144, 171], which provides theoretical convergence guarantees in addition to computational advantage. Automated approaches for HPO are also an active area of research in Machine Learning [26, 92, 220].
Finally, since the recent widespread of machine learning to several research disciplines and in the mainstream industry can be largely attributed to the availability of data and the relatively easy to use libraries, we summarize in the online supplement the resources that may be of value for research.
Acknowledgement
We are very grateful to four anonymous referees for their valuable feedback and comments that helped improve the content and presentation of the paper. Joe Naoum-Sawaya was supported by NSERC Discovery Grant RGPIN-2017-03962 and Bissan Ghaddar was supported by NSERC Discovery Grant RGPIN-2017-04185.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. Technical report, ar Xiv preprint 1412.6830, 2014.
- 2[2] Md Ashad Alam, Hui-Yi Lin, Hong-Wen Deng, Vince D Calhoun, and Yu-Ping Wang. A kernel machine method for detecting higher order interactions in multimodal datasets: Application to schizophrenia. Journal of Neuroscience Methods , 309:161–174, 2018.
- 3[3] Daniel Aloise, Pierre Hansen, and Leo Liberti. An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming , 131(1):195–220, 2012.
- 4[4] Edoardo Amaldi and Stefano Coniglio. A distance-based point-reassignment heuristic for the k-hyperplane clustering problem. European Journal of Operational Research , 227(1):22–29, 2013.
- 5[5] Edoardo Amaldi, Stefano Coniglio, and Leonardo Taccari. Discrete optimization methods to fit piecewise affine models to data points. Computers & Operations Research , 75:214–230, 2016.
- 6[6] Yasunori Aoki, Ken Hayami, Hans De Sterck, and Akihiko Konagaya. Cluster Newton method for sampling multiple solutions of underdetermined inverse problems: application to a parameter identification problem in pharmacokinetics. SIAM Journal on Scientific Computing , 36(1):14–44, 2014.
- 7[7] Alper Atamturk and Andres Gomez. Rank-one convexification for sparse regression. Technical report, ar Xiv preprint 1901.10334, 2019.
- 8[8] Haldun Aytug. Feature selection for support vector machines using generalized Benders decomposition. European Journal of Operational Research , 244(1):210–218, 2015.
