A discrete version of CMA-ES

Eric Benhamou; Jamal Atif; Rida Laraki

arXiv:1812.11859·cs.LG·February 13, 2019

A discrete version of CMA-ES

Eric Benhamou, Jamal Atif, Rida Laraki

PDF

TL;DR

This paper introduces a discrete version of CMA-ES, extending the algorithm to handle multivariate binomial distributions for optimizing discrete variables, which was previously limited to continuous variables.

Contribution

The authors develop a novel discrete CMA-ES variant using multivariate binomial distributions, capable of modeling higher-order interactions for discrete optimization tasks.

Findings

01

The discrete CMA-ES models correlations efficiently through variable interactions.

02

The distribution can estimate pairwise and higher-order interactions.

03

The paper provides a complete algorithm for the discrete CMA-ES.

Abstract

Modern machine learning uses more and more advanced optimization techniques to find optimal hyper parameters. Whenever the objective function is non-convex, non continuous and with potentially multiple local minima, standard gradient descent optimization methods fail. A last resource and very different method is to assume that the optimum(s), not necessarily unique, is/are distributed according to a distribution and iteratively to adapt the distribution according to tested points. These strategies originated in the early 1960s, named Evolution Strategy (ES) have culminated with the CMA-ES (Covariance Matrix Adaptation) ES. It relies on a multi variate normal distribution and is supposed to be state of the art for general optimization program. However, it is far from being optimal for discrete variables. In this paper, we extend the method to multivariate binomial correlated…

Equations124

P (X = x) = p^{x} (1 - p)^{1 - x}, x \in {0, 1}

P (X = x) = p^{x} (1 - p)^{1 - x}, x \in {0, 1}

P (X = x)

P (X = x)

θ_{1}

θ_{1}

θ_{2}

θ_{12}

T (X) = (X_{1}, X_{2}, X_{1} X_{2})^{T}

T (X) = (X_{1}, X_{2}, X_{1} X_{2})^{T}

P (X = x)

P (X = x)

p_{00} = \frac{1}{1 + exp ( θ _{1} ) + exp ( θ _{2} ) + exp ( θ _{1} + θ _{2} + θ _{12} )},

p_{00} = \frac{1}{1 + exp ( θ _{1} ) + exp ( θ _{2} ) + exp ( θ _{1} + θ _{2} + θ _{12} )},

p_{10} = \frac{exp ( θ _{1} )}{1 + exp ( θ _{1} ) + exp ( θ _{2} ) + exp ( θ _{1} + θ _{2} + θ _{12} )},

p_{01} = \frac{exp ( θ _{2} )}{1 + exp ( θ _{1} ) + exp ( θ _{2} ) + exp ( θ _{1} + θ _{2} + θ _{12} )},

p_{11} = \frac{exp ( θ _{1} + θ _{2} + θ _{12} )}{1 + exp ( θ _{1} ) + exp ( θ _{2} ) + exp ( θ _{1} + θ _{2} + θ _{12} )} .

P (X_{1} = x_{1}) = (p_{10} + p_{11})^{x_{1}} (p_{00} + p_{01})^{(1 - x_{1})} .

P (X_{1} = x_{1}) = (p_{10} + p_{11})^{x_{1}} (p_{00} + p_{01})^{(1 - x_{1})} .

\displaystyle\mathbb{P}(X_{1}=x_{1}|X_{2}=x_{2})=\biggl{(}\frac{p_{1x_{2}}}{p_{1x_{2}}+p_{0x_{2}}}\biggr{)}^{x_{1}}

\displaystyle\mathbb{P}(X_{1}=x_{1}|X_{2}=x_{2})=\biggl{(}\frac{p_{1x_{2}}}{p_{1x_{2}}+p_{0x_{2}}}\biggr{)}^{x_{1}}

\displaystyle\biggl{(}\frac{p_{0,x_{2}}}{p_{1,x_{2}}+p_{0,x_{2}}}\biggr{)}^{1-x_{1}}

(k n) p^{k} (1 - p)^{n - k} ≃ \frac{1}{2 π n p ( 1 - p )} e^{- \frac{( k - n p ) ^{2}}{2 n p ( 1 - p )}}

(k n) p^{k} (1 - p)^{n - k} ≃ \frac{1}{2 π n p ( 1 - p )} e^{- \frac{( k - n p ) ^{2}}{2 n p ( 1 - p )}}

p_{θ} = exp (< θ, Φ (x) > - A (θ)),

p_{θ} = exp (< θ, Φ (x) > - A (θ)),

A (θ) = lo g \int exp (< θ, Φ (x) >) d μ (x)

A (θ) = lo g \int exp (< θ, Φ (x) >) d μ (x)

p_{θ} = exp (< θ, Φ (x) > - A (θ)),

p_{θ} = exp (< θ, Φ (x) > - A (θ)),

A (θ) = lo g \int \sum (< θ, Φ (x) >)

A (θ) = lo g \int \sum (< θ, Φ (x) >)

exp (θ_{1} x + θ_{2} x^{2} - A (θ))

exp (θ_{1} x + θ_{2} x^{2} - A (θ))

p = \frac{1 - 1 - 4 σ ^{2} / n}{2}

p = \frac{1 - 1 - 4 σ ^{2} / n}{2}

(μ + (B (n, p) - n p)) mod n

(μ + (B (n, p) - n p)) mod n

p (X = i) = c ρ^{i}

p (X = i) = c ρ^{i}

(1 + μ) ρ + (μ - (n + 1)) ρ^{n + 1} + (n - μ) ρ^{n + 2} = μ,

(1 + μ) ρ + (μ - (n + 1)) ρ^{n + 1} + (n - μ) ρ^{n + 2} = μ,

P (X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{K} = x_{K})

P (X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{K} = x_{K})

= p (0, 0, \dots, 0)^{\prod_{j = 1}^{K} (1 - x_{j})} \times p (1, 0, \dots, 0)^{x_{1} \prod_{j = 2}^{K} (1 - x_{j})}

\times p (0, 1, \dots, 0)^{(1 - x_{1}) x_{2} \prod_{j = 3}^{K} (1 - x_{j})} \times \dots \times

\times p (1, 1, \dots, 1)^{\prod_{j = 1}^{K} x_{j}},

P (X) = exp (< θ, T (X) > - A (θ))

P (X) = exp (< θ, T (X) > - A (θ))

θ_{i_{1}, \dots, i_{l}}

θ_{i_{1}, \dots, i_{l}}

with E v e n

and O dd

p (1 for i_{1}, \dots, i_{l} rest with 0) = \frac{exp ( S ^{i_{1}, \dots, i_{l}} )}{D} .

p (1 for i_{1}, \dots, i_{l} rest with 0) = \frac{exp ( S ^{i_{1}, \dots, i_{l}} )}{D} .

S^{i_{1}, \dots, i_{l}} = {i_{1}, \dots, i_{m}} \in Υ_{{i_{1}, \dots, i_{l}}} \sum θ_{i_{1}, \dots, i_{m}}

S^{i_{1}, \dots, i_{l}} = {i_{1}, \dots, i_{m}} \in Υ_{{i_{1}, \dots, i_{l}}} \sum θ_{i_{1}, \dots, i_{m}}

D = l = 0, .., k 1 \leq i_{1} \leq \dots \leq i_{l} \sum exp (S^{i_{1}, \dots, i_{l}})

D = l = 0, .., k 1 \leq i_{1} \leq \dots \leq i_{l} \sum exp (S^{i_{1}, \dots, i_{l}})

θ_{i_{1}, \dots, i_{l}} = 0 \forall1 \leq i_{1} < \dots < i_{l} \leq k, l \geq 2.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A discrete version of CMA-ES

Eric Benhamou†, ‡, Jamal Atif*‡, Rida Laraki‡***

[email protected] or [email protected],

[email protected], [email protected]

A.I. Square Connect*†* Lamsade, Universite Paris Dauphine, PSL*‡*

Abstract

Modern machine learning uses more and more advanced optimization techniques to find optimal hyper parameters. Whenever the objective function is non-convex, non continuous and with potentially multiple local minima, standard gradient descent optimization methods fail. A last resource and very different method is to assume that the optimum(s), not necessarily unique, is/are distributed according to a distribution and iteratively to adapt the distribution according to tested points. These strategies originated in the early 1960s, named Evolution Strategy (ES) have culminated with the CMA-ES (Covariance Matrix Adaptation) ES. It relies on a multi variate normal distribution and is supposed to be state of the art for general optimization program. However, it is far from being optimal for discrete variables. In this paper, we extend the method to multivariate binomial correlated distributions. For such a distribution, we show that it shares similar features to the multi variate normal: independence and correlation is equivalent and correlation is efficiently modeled by interaction between different variables. We discuss this distribution in the framework of the exponential family. We prove that the model can estimate not only pairwise interactions among the two variables but also is capable of modeling higher order interactions. This allows creating a version of CMA ES that can accomodate efficiently discrete variables. We provide the corresponding algorithm and conclude.

1 Introduction

When facing an optimization problem where there is no access to the objective function’s gradient or the objective function’s gradient is not very smooth, the state of the art techniques rely on stochastic and derivative free algorithm that change radically the point of view of the optimization program. Instead of a deterministic gradient descent, we take a Bayesian point of view and assumes that the optimum is distributed according to a prior statistical distribution and uses particles or random draws to gradually update our statistical distribution. Among these method, the covariance matrix adaptation evolution strategy (CMA-ES; e.g., [11, 10]) has emerged as the leading stochastic and derivative-free algorithm for solving continuous optimization problems, i.e., for finding the optimum denoted by $\mathbf{x}^{*}$ of a real-valued objective function $f$ , defined on a subset of a multi dimensional space of dimension $d$ : $\mathbb{R}^{d}$ .This method generates candidate points $\{\mathbf{x}_{i}\}$ , $i\in\{1,2,\dots,\lambda\}$ , from a multivariate Gaussian distribution. It evaluates their objective function (also called fitness) values $\{f(\mathbf{x}_{i})\}$ . As the distribution is characterized by its two first moments, it updates the mean vector and covariance matrix by using the sampled points and their fitness values, $\{(\mathbf{x}_{i},f(\mathbf{x}_{i}))\}$ . The algorithm keeps repeating the sampling-evaluation-update procedure (which can be seen like an exploration exploitation method until the distribution contracts to a single point or reaches the maximum of iterations. Convergence is measured either by a very small covariance matrix. The different variations around the original method to improve the convergence investigate various heuristic method to update the distribution parameters. This strongly determines the behavior and efficiency of the whole algorithm. The theoretical foundation of the CMA-ES are that for continuous variables with given first two moments, the maximum entropy distribution is the normal distribution and that the update is based on a maximum-likelihood estimation, making this method based on a statistical principle.

A natural question is to adapt this method for discrete variables. Surprisingly, this has not been done before as the Gaussian distribution is a continuous time distribution inappropriate to discrete variables. One needs to change the underlying distirbution and also find the way to correlate the marginal distribution which can be tricky. However, we show in this paper that multivariate binomials are the natural discrete counterpart of Gaussian distributions. Hence we are able to change CMA Es to accommodate for discrete variables. This is the subject of this paper. In the section 2, we introduce the multivariate binomial distribution. Presenting this distribution in the general setting of exponential family, we can easily derive various properties and connect this distribution to maximum entropy. We also proved that for the assumed correlation structure, independence and correlation are equivalent, which is a also a feature of Gaussian distributions. In section 3, we present the algorithm.

2 Primer on Multivariate Binomials

2.1 Intuition

To start building some intuition on multivariate binomials, we start by the simplest case, that is a two dimensional Bernoulli. It is the extension to two dimensions of the univariate Bernoulli distribution. A Bernoulli random variable $X$ , is a discrete variable that takes the value $1$ with probability $p$ and [math] otherwise. The usual notation for the probability mass function is

[TABLE]

A natural extension is to consider the random vector $X=(X_{1},X_{2})$ . It takes values in the Cartesian product space $\{0,1\}^{2}=\{0,1\}\times\{0,1\}$ . If we denote the joint probabilities $p_{ij}=P(X_{1}=i,X_{2}=j)$ for $i,j\in\{0,1\}$ , then the probability for the bivariate Bernoulli writes:

[TABLE]

with the side conditions that the joint probabilities are between [math] and $1$ : for $i,j\in\{0,1\}$ , $0\leq p_{ij}\leq 1$ and they sum to one $p_{00}+p_{10}+p_{01}+p_{11}=1$

It is however better to write the joint distribution in terms of canonical parameters of the related exponential family. Hence, if we define

[TABLE]

and $T(x)$ the vector of sufficient statistics denoted by

[TABLE]

we can rewrite the distribution as an exponential family distribution as follows:

[TABLE]

where the log partition function $A(\theta)$ is defined such as the probability normalizes to one. It is very easy to check that $A(\theta)=-\log p_{00}$ . We can also relate the initial moment parameters $\{p_{ij}\}$ for for $i,j\in\{0,1\}$ , to the canonical parameters as follows

Proposition 1.

The moment parameters can be expressed in terms of the canonical parameters as follows;

[TABLE]

Proof.

See A.1 ∎

The expression of the distribution in terms of the canonical parameters is particularly useful as it indicates immediately that independence and correlation are equivalent as in a Gaussian distribution. We will see that this result generalizes to multivariate binomial in the next subsection 2.3.

Proposition 2.

The components of the bivariate Bernoulli random vector $(X_{1},X_{2})$ are independent if and only if $\theta_{12}$ is zero. Like for a normal distribution, independence and correlation are equivalent.

Proof.

See A.2 ∎

The equivalence between correlation and independence was already presented in [20] where it was referred to as Proposition 2.4.1. The importance of $\theta_{12}$ referred to as the cross term or u-terms) is discussed and called cross-product ratio between $X_{1}$ and $X_{2}$ . In [16] and [15], this cross product ratio is also identified but called the log odds.

Intuitively, there are similarities between Bernoulli (their sum version that is the Binomial) and the Gaussian. And like for the multivariate Gaussian, we can prove that the marginal and the conditional Bernoulli are still binomial as shown by the proposition 3, making the analogy between Bernoulli (and soon their independent sum version which is the Binomial) and Gaussian even more striking!

Proposition 3.

In the bivariate Bernoulli vector whose probability mass function is given by 1

•

the marginal distribution of $X_{1}$ is also a univariate Bernoulli whose probability mass function is

[TABLE]

•

the conditional distribution of $X_{1}$ given $X_{2}$ is also a univariate Bernoulli whose probability mass function is

[TABLE]

Proof.

See A.3 ∎

Before we move on to the generalization, we need to mention a few important facts. First, recall that the sum of $n$ independent Bernoulli trials with parameter $p$ is a Binomial with parameter $n$ and $p$ . And when we talk about independent sum, it should ring a bell! Independent sum should make you immediately think about the Central Limit Theorem. This intuition is absolutely correct and shows the connection between the binomial and Gaussian distribution. This is the Moivre Laplace theorem stated below

Theorem 1.

If the variable $B_{n}$ follows a binomial distribution with parameters $n$ and $p$ in $]0,1[$ , then the variable $Z_{n}=\frac{B_{n}-np}{\sqrt{np(1-p)}}=\sqrt{n}\frac{B_{n}/n-p}{p(1-p)}$ converges in law to a standard normal law $\mathcal{N}(0,1)$ . Another presentation of this result is to say that, for $p\in]0,1[$ , as $n$ grows large, for $k$ in the neighborhood of $np$ we can approximate the binomial distribution by a normal as follows:

[TABLE]

The proof of this theorem is traditionally done with doing a Taylor expansion of the characteristic function. An alternative proof is to use the Sterling formula as well as a Taylor expansion to relate the binomial distribution to the normal one. Historically, de Moivre was the first to establish this theorem in 1733 in the particular case: $p=1/2$ . Laplace generalized it in 1812 for any value of $p$ between 0 and 1 and started creating the ground for the central limit theorem that extended this result far beyond. Later on, many more mathematicians generalized and extended this result like Cauchy, Bessel, Poisson but also von Mises, Pólya, Lindeberg, Lévy, Cramér as well as Chebyshev, Markov and Lyapunov.

Second, if we take an infinite sum of Bernoulli, this is a discrete distribution that is the asymptotic limit of the binomial distribution. This is also a distribution that is part of the exponential family and is given by the Poisson distribution.

Proposition 4.

For a large number $n$ of independent Bernoulli trials with probability $p$ such that $\lim\limits_{n\to\infty}np=\lambda$ , then the corresponding binomial distribution with parameter $n$ and $p$ converges in distribution to the Poisson distribution

Proof.

See A.4 ∎

The two previous results show that binomials, Poisson and Gaussian distributions that are part of the exponential family are closely connected and represent the discrete and continuous version of very similar concepts, namely independent and identically distributed increments.

2.2 Maximum entropy

It is also interesting to relate these distributions to maximum Shannon entropy. Let a function : $\Phi:\Xi\to\mathbb{R}^{d}$ , where $\Xi$ is the space of the random variable $X$ and a vector $\alpha\in\mathbb{R}^{d}$ . It is well known that the maximum entropy distribution whose constraint is given by $\mathbb{E}_{P}\left[\Phi(X)\right]=\alpha$ is a distribution of the exponential family given by the following theorem

Theorem 2.

The distribution that maximizes the Shannon entropy : $-\int p(x)\log p(x)d\mu(x)$ subject to the constraint $\mathbb{E}_{P}\left[\phi(X)\right]=\alpha$ and the obvious probability constraints $\int p(x)d\mu(x)=1$ , $p(x)\geq 0$ , is the unique distribution that is part of the exponential family and given by

[TABLE]

with

[TABLE]

Proof.

See A.5 ∎

*Remark 2.1**.*

The theorem 2 works also for discrete distributions. It says that the discrete distribution that maximizes the Shannon entropy $-\sum p(x)\log p(x)$ subject to the constraint $\mathbb{E}_{P}\left[\phi(X)\right]=\alpha$ and the obvious probability constraints $\sum p(x)=1$ , $p(x)\geq 0$ , is the unique distribution that is part of the exponential family and given by

[TABLE]

with

[TABLE]

The theorem 2 implies in particular that the continuous distribution that maximizes entropy with given mean and variance (or equivalently first and second moments) is an exponential family of the form

[TABLE]

where the log partition function $A(\theta)$ is defined to ensure the probability distribution sums to one. This distribution is indeed a normal distribution as it is the exponential of a quadratic form. This is precisely the continuous distribution used in the CMA ES algorithm. Taking a distribution that maximizes the entropy means that we take a distribution that has the less information prior. Or said differently, this is the distribution with the minimal prior structural constraint. If nothing is known, it should therefore be preferred.

Ideally, for our CMA ES discrete adaptation, we would like to find the discrete distribution equivalent of the normal. if we want the discrete distribution with independent increment, we should turn to binomial distributions. Binomials have the other advantage to converge to the normal distribution whenever the discrete parameter converges to a continuous one. Binomials are also distributions that are part of the exponential family. But we are facing various problems. To keep the discussion simple, let us first look at a single parameter that can take as values all the integer between [math] to $n$

First of all, we face the issue of controlling the first two moments of our distribution or equivalent to be able to control the mean denoted by $\mu$ and the variance denoted by $\sigma^{2}$ . Binomial distributions do not have two parameters like normals to be able to adapt to first and second moments constraints as easily as normals. Indeed for our given parameter $n$ that is the number of discrete state of our parameter to optimize in the discrete CMA ES, we are only left with a single parameter $p$ for our binomial distribution $\mathcal{B}(n,p)$ to accommodate for the constraints. The expectation is given by $np$ while the variance is given by $np(1-p)$ . If we would like to have a discrete distribution that progressively peaks to the minimum, we would like to be able to force the variance to converge to [math]. This will fix the variance to $\sigma^{2}=np(1-p)$ . We can easily solve this quadratic equation $p^{2}-p+\sigma^{2}/n=0$ and use the minimal solution given by

[TABLE]

provided that $\sigma^{2}\leq n/4$ . As $\sigma$ will tend to zero, the parameter $p$ will tend to zero. In order to accommodate for the mean constraint, we need a work-around. We see that the discrete parameter is we do not do anything will converge to [math] as $p$ will converge to [math]. A solution that is simple is to assume that our discrete parameter is distributed according to

[TABLE]

where $a\mod n$ is a modulo $n$ . It is the remainder of the Euclidean division of a by $n$ . This method will ensure that we sample all possible [math] to $n$ possible value with a mean that is equal to $\mu$ and a variance controlled by the parameter $p$ .

Secondly, we would like to use a discrete distribution that maximizes the entropy. This is the case for the continuous version of CMA-ES with the normal distribution. However, for discrete distribution, this maximum entropy condition is not as easy. It is well known that the maximum entropy discrete distribution with a given mean is not the binomial distribution but rather the distribution given by

[TABLE]

where $c=1/{\sum\limits_{i=0}^{n}\rho^{i}}=\frac{1-\rho}{1-\rho^{n+1}}$ and where $\rho$ is determined such as $\sum\limits_{i=0}^{n}c\,\,i\,\,\rho^{i}=\mu$ which leads to the implicit equation for $\rho$ :

[TABLE]

using the well known geometric identities: $\sum\limits_{i=0}^{n}\rho^{i}=\frac{1-\rho^{n+1}}{1-\rho}$ and $\sum\limits_{i=0}^{n}i\rho^{i}=\frac{\rho\frac{1-\rho^{n+1}}{1-\rho}-(n+1)\rho^{n+1}}{1-\rho}$ . The distribution is sometimes referred to as the truncated geometric distribution. This is not our desirable binomial distribution. Obviously, we can rule out this truncated geometric distribution as its probability mass function does not make sense for our parameter. The probability mass function is decreasing which is not a desirable feature. Rather, we would like a bell shape, which is the case for our binomial distribution. The tricky question is how to relate our binomial distribution to a maximum entropy principle as this is the case for the normal.

We can first remark that the binomial distribution is not too far away from a geometric distribution when the number of trials $n$ tends to infinity at least for some terms. Indeed, the probability mass function is given by $\binom{n}{k}p^{k}(1-p)^{n-k}$ . And using the Sterling formula, we can see that for $n$ large, we can approximate factorial $n$ as follows $n\,!\sim\sqrt{2\pi n}\,\left(\frac{n}{e}\right)^{n}$ , which leads an asymptotic term similar to the geometric distribution. This gives some hope that there should be a way to relate our binomial distribution to a maximum entropy principle. And the trick here is to reduce the space of possible distributions. It instead of looking at the entire space of distribution, we reduce the space of possible distributions to any Poisson binomial distributions (also referred to in the statistics literature as the generalized binomial distribution), we could find a solution. The latter distribution named after the famous French mathematician Siméon Denis Poisson is the discrete probability distribution of a sum of independent Bernoulli trials that are not necessarily identically distributed. And nicely, restricting the space of possible distribution to any Poisson binomial distributions, theorem 3 proves that the binomial distribution is the distribution that maximizes the entropy for a given mean.

Theorem 3.

Among all Poisson binomial distributions with $n$ trials, the distribution that maximizes the Shannon entropy : $-\sum p(x)\log p(x)$ subject to the constraint $\sum xp(x)=\mu$ and the obvious probability constraints is the binomial distribution $\mathcal{B}(n,p)$ such that $np=\mu$

Proof.

See A.6 ∎

2.3 Multivariate and Correlated Binomials

Equipped with the intuition of the first section, we can see the profound connection between multivariate normal and multivariate binomial. We will define our multivariate binomial as the sum for $n$ independent trials of multivariate Bernoulli defined as before.

Let $X=(X_{1},\ldots,X_{k})$ be a k-dimensional random vector of possibly correlated binomial random variables that may have different parameters $n_{i}$ and $p_{i}$ and let $x=(x_{1},\ldots,x_{k})$ be a realization of $X$ . The joint probability is given naturally by

[TABLE]

Like for the simple case of section 2, we can re-write this joint probability in the exponential form. Let us give some notations.

Let $T(X)$ be the vector $(X_{1},...,X_{k},$ $X_{1}X_{2},\ldots,X_{1}\ldots X_{k})^{T}$ of size $2^{k}-1$ whose elements represents all the possible $1$ to $k$ selection of $X_{1},\ldots,X_{k}$ . These $1$ to $k$ selections of $X_{1},\ldots,X_{k}$ are all the possible monomial polynomials of $X_{1},\ldots,X_{k}$ of degree $1$ to $k$ . By monomial, we mean that we can take only distinct power of $X_{1},\ldots,X_{k}$ with all of them having an exponent equal to 0 or 1. We also denote by $(i_{1},\ldots,i_{l})$ an ordered set of $1\leq l\leq k$ elements of the integers from $1$ to $k$ and by $\Upsilon_{\{1,\ldots,k\}}$ the set of all the order sets $(i_{1},\ldots,i_{l})$ with $1\leq l\leq k$ elements elements. $\Upsilon_{\{1,\ldots,k\}}$ is also the sets of all possible non empty sets with integer elements in $\{1,\ldots,k\}$ . Similarly, $\Upsilon_{\{i_{1},\ldots,i_{l}\}}$ is the sets of all possible non empty set with elements in $\{i_{1},\ldots,i_{l}\}$ . Finally, $\Upsilon_{\{i_{1},\ldots,i_{l}\}}^{even}$ (respectively $\Upsilon_{\{i_{1},\ldots,i_{l}\}}^{odd}$ ) is the subset of $\Upsilon_{\{i_{1},\ldots,i_{l}\}}$ for set whose cardinality is even (respectively odd).

We are now able to provide the following proposition that gives the exponential form of the multi variate binomial mass probability function:

Proposition 5 (Exponential form).

The multivariate Bernoulli model has a probability mass function of the exponential form given by

[TABLE]

where the sufficient statistic $T(X)$ is $T(X)=(X_{1},...,X_{k},X_{1}X_{2},\ldots,X_{1}\ldots X_{k})$ , the log partition function $A(\theta)$ is $A(\theta)=-\log p(0,0,\ldots,0)$ and the coefficients $\theta$ are given by:

[TABLE]

Similarly, we can compute the regular probabilities from the canonical parameters as follows:

[TABLE]

where $S^{i_{1},\ldots,i_{l}}$ is the sum of all the theta parameters indexed by any non empty selection within $\{i_{1},\ldots,i_{l}\}$ :

[TABLE]

with the convention for the empty set, $S=1$ and $D$ is the normalizing constant such that all the probabilities sum to 1:

[TABLE]

with the convention for $l=0$ that $\exp(S^{i_{1},\ldots,i_{l}})=1$ .

Proof.

See A.7 ∎

Last but not least, we can extend the result already found for the simple two dimension Bernoulli variable to the general multi dimensional Bernoulli concerning independence and correlation. Recall that one of the important statistical properties for the multivariate Gaussian distribution is the equivalence of independence and no correlation. This is a remarkable properties of the Gaussian (although more could be said about independent and Gaussian as explained for instance in [6]).

The independence of a random vector is determined by the separability of coordinates in its probability mass function. If we use the natural (or moment) parameter form of the probability mass function, this is not obvious. However, using the exponential form, the result is almost trivial and is given by the following proposition

Proposition 6 ((Independence of Bernoulli outcomes)).

The multivariate Bernoulli variable $X=(X_{1},\ldots,X_{k})$ is independent element-wise if and only if

[TABLE]

Proof.

See A.8 ∎

*Remark 2.2**.*

The condition of equivalence between independence and no correlation can also be rewritten as

[TABLE]

*Remark 2.3**.*

A general multi variate binomial model implies $2^{n}-1$ parameters, which is way to many when $n$ is large. A simpler model is to impose that only the probabilities involving one state $X_{i}$ or two states $X_{i},X_{j}$ are non zero. This is in fact the Ising model.

3 Algorithm

3.0.1 CMA-ES estimation

Another radically difference approach is to minimize some cost function depending on the Kalman filter parameters. As opposed to the maximum likelihood approach that tries to find the best suitable distribution that fits the data, this approach can somehow factor in some noise and directly target a cost function that is our final result. Because our model is an approximation of the reality, this noise introduction may leads to a better overall cost function but a worse distribution in terms of fit to the data.

Let us first introduce the CMA-ES algorithm. Its name stands for covariance matrix adaptation evolution strategy. As it points out, it is an evolution strategy optimization method, meaning that it is a derivative free method that can accomodate non convex optimization problem. The terminology covariance matrix alludes to the fact that the exploration of new points is based on a multinomial distribution whose covariance matrix is progressively determined at each iteration. Hence the covariance matrix adapts in a sense to the sampling space, contracts in dimension that are useless and expands in dimension where natural gradient is steep. This algorithm has led to a large number of papers and articles and we refer to [19], [17], [2], [1], [9], [4], [8], [3], [14], [5] to cite a few of the numerous articles around CMA-ES. We also refer the reader to the excellent wikipedia page [21].

In order to adapt CMA ES to discrete variables, we change in the algorithm the generation so Gaussian variables into the ones of multi variate binomials as follows:

[TABLE]

The corresponding algorithm is given in 1.

4 Conclusion

In this paper, we showed that using multi-variate correlated binomial distribution, we can derive an efficient adaptation of CMA-ES for discrete variable optimization problem using correlated binomials. We have proved that correlated binomials share some similarities with normal distribution in terms of independence and correlation equivalence as well as rich information for correlation structure. In order to avoid too many parameters, we impose that only single state and bi-state probabilities are not null. In the future, we hope to develop additional variations around this CMA-ES version for the combination of discrete and continuous variables mixing potentially multivariate binomial and normal distributions.

Appendix A Proofs

A.1 Proof of proposition 1

Proof.

We can trivially infer all the moment parameters from equations 2, 3 and 4. ∎

A.2 Proof of proposition 2

Proof.

The exponential family formulation of the bivariate Bernoulli distribution shows that a necessary and sufficient condition for the distribution to seperable into two components with each only depending on $x_{1}$ and $x_{2}$ respectively is that $\theta_{12}=0$ . This proves the first assertion of proposition 2.

Proving equivalence between correlation and independence is the same as proving equivalence between covariance and independence. The covariance between $X_{1}$ and $X_{2}$ is easy to calculate and given by

[TABLE]

where in equation 29, we have used that the four probabilities sum to one. Hence, the correlation or the covariance is null for non trivial probabilities if and only if $\theta_{12}=0$ , which is equivalent to the independence. ∎

A.3 Proof of proposition 3

Proof.

For the coordinate $X_{1}$ , it is trivial to see that

[TABLE]

which shows that $X_{1}$ follows the univariate Bernoulli distribution with density given by equation (11). Likewise, it is trivial to see that

[TABLE]

Similar results apply for the condition $X_{2}=1$ , which shows the second result and concludes the proof. ∎

A.4 Proof of proposition 4

Proof.

Let us write the limit for the binomial distribution when number of trials $n\to\infty$ , and probability of success in trial $p\to 0$ but $np\to\lambda$ remains finite. We have for a given $k$

[TABLE]

which proves that the binomial converges to the Poisson distribution. ∎

A.5 Proof of theorem 2

Proof.

We follow the proof of Theorem 11.1.1 of [7]. If we write the Lagrangian $\mathcal{L}(p,\theta,\theta_{0},\lambda)$ for the problem

[TABLE]

where the Lagrange multipliers $(\theta,\theta_{0},\lambda)$ are for the three constraints, we have

[TABLE]

and noticing that the function to optimize is convex and satisfies the Slater’s constraint, we can use Lagrange duality to characterize the solution as the solution of the critical point given by

[TABLE]

or equivalently,

[TABLE]

As this solution always satisfies the condition $p(x)>0$ , we have necessarily that the Lagragian multiplier related to the constraint $p(x)>0$ should be null: $\lambda(x)=0$ . The solution should be a probability distribution, which implies that

[TABLE]

which imposes that

[TABLE]

or equivalently, writing in the exponential form $\theta_{0}-1=A(\theta)=\log\int\exp(<\theta,\phi(x)>)d\mu(x)$ , we have that $p$ satisfies

[TABLE]

which shows that the distribution is part of the exponential family.

To prove its uniqueness, we use the fact that the Shannon entropy is related to the Kullback Leibler divergence $D_{kl}$ as follows:

[TABLE]

which concludes the proof as the Kullback Leibler divergence $D_{kl}(P\|P_{\theta_{0}})>0$ unless $P=P_{\theta_{0}}$ ∎

A.6 Proof of theorem 3

Proof.

We will prove a stronger result that the entropy $H(p_{1},\ldots,p_{n})$ of a generalized binomial distribution with parameters $n,p_{1},\ldots,p_{n}$ is Schur concave (see [18] for a definition and some properties). A straight consequence of Schur concavity is that the function is maximum for the constant function as follows:

[TABLE]

with $\bar{p}=\frac{\sum_{i=1}^{n}p_{i}}{n}$ . This will prove that the regular binomial distribution satisfies the maximum entropy principle.

Our proof of the Schur concavity uses the same trick as in [13], namely the usage of elementary symmetric functions. Let us denote by $(X_{i})_{i=1,\ldots,n}$ the independent Bernoulli variables with parameter $p_{i}$ and their sum $S_{n}=\sum_{i=1}^{n}X_{i}$ the variable for the canonical Poisson binomial variable. Its probability mass function writes as

[TABLE]

where $F_{k}$ is the set of all subsets of $k$ integers selected from $\{1,2,3,...,n\}$ . The entropy $H(p_{1},\ldots,p_{n})$ is permutation symmetric, hence to prove Schur concavity, it suffices to show that the cross term $(p_{1}-p_{2})(\frac{\partial H}{p_{1}}-\frac{\partial H}{p_{2}})$ is negative. Let us compute

[TABLE]

We can notice that for $k\geq 2$ and $k\leq n-2$

[TABLE]

where $\pi_{j}^{n-2}=\pi_{j}^{n-2}(p_{3},\ldots,p_{n})$ . The equation (41) can be extended for $k=0,1$ or $k=n-1,n$ with the convention that $\pi_{j}^{n}=0$ for $j<0$ and $\pi_{j}^{n}=0$ for $j>n$ . Hence equation (41) is valid for any $k$ . Deriving equation (41) with respect to $p_{i}$ leads to

[TABLE]

Combining equations (40) and (42), we have

[TABLE]

Recall a result that is allegedly attributed to Newton about elementary symmetric functions. Denote the product

[TABLE]

with $c_{k}$ the kth elementary function of the a’s. We have

[TABLE]

unless all a are equal (see for instance [12] theorem 51 page 52 section 2.22). Let us take the function

[TABLE]

we have therefore that

[TABLE]

Combining equations (45) and (48), we proved that

[TABLE]

which concludes the proof ∎

A.7 Proof of proposition5

Proof.

Comparing the equations (2.3) and (17), and using the provided sufficient statistics given in proposition5, we see by identification that the parameters $\theta$ should be given by the equation (18) with the terms with a plus sign given by equation (19) and the terms with a negative sign given by equation (20)

Similarly, if we take equation (21), (22) and (23), we can notice that the probabilities given sum to one, that $D=1/p_{0,\ldots,0}$ and that from the expression giving the theta’s (18), we back out the probabilities. This concludes the proof. ∎

A.8 Proof of proposition6

Proof.

The proof of proposition 6 is immediate using the exponential form as the independence is equivalent to the separability which is equivalent to equation (24) ∎

Appendix B Pseudo code

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. Continuous optimization and CMA-ES. In Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July 11-15, 2015, Companion Material Proceedings , pages 313–344, 2015.
2[2] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. CMA-ES and advanced adaptation mechanisms. In Genetic and Evolutionary Computation Conference, GECCO 2016, Denver, CO, USA, July 20-24, 2016, Companion Material Proceedings , pages 533–562, 2016.
3[3] Anne Auger and Nikolaus Hansen. Benchmarking the (1+1)-CMA-ES on the BBOB-2009 noisy testbed. In Genetic and Evolutionary Computation Conference, GECCO 2009, Proceedings, Montreal, Québec, Canada, July 8-12, 2009, Companion Material , pages 2467–2472, 2009.
4[4] Anne Auger and Nikolaus Hansen. Tutorial CMA-ES: evolution strategies and covariance matrix adaptation. In Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7-11, 2012, Companion Material Proceedings , pages 827–848, 2012.
5[5] Anne Auger, Marc Schoenauer, and Nicolas Vanhaecke. LS-CMA-ES: A second-order algorithm for covariance matrix adaptation. In Parallel Problem Solving from Nature - PPSN VIII, 8th International Conference, Birmingham, UK, September 18-22, 2004, Proceedings , pages 182–191, 2004.
6[6] Eric Benhamou, Beatrice Guez, and Nicolas Paris. Three remarkable properties of the Normal distribution. ar Xiv e-prints , October 2018.
7[7] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) . Wiley-Interscience, New York, NY, USA, 2006.
8[8] Nikolaus Hansen and Anne Auger. CMA-ES: evolution strategies and covariance matrix adaptation. In 13th Annual Genetic and Evolutionary Computation Conference, GECCO 2011, Companion Material Proceedings, Dublin, Ireland, July 12-16, 2011 , pages 991–1010, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Abstract

1 Introduction

2 Primer on Multivariate Binomials

2.1 Intuition

Proposition 1**.**

Proof.

Proposition 2**.**

Proof.

Proposition 3**.**

Proof.

Theorem 1**.**

Proposition 4**.**

Proof.

2.2 Maximum entropy

Theorem 2**.**

Proof.

Remark 2.1*.*

Theorem 3**.**

Proof.

2.3 Multivariate and Correlated Binomials

Proposition 5** (Exponential form).**

Proof.

Proposition 6** ((Independence of Bernoulli outcomes)).**

Proof.

Remark 2.2*.*

Remark 2.3*.*

3 Algorithm

3.0.1 CMA-ES estimation

4 Conclusion

Appendix A Proofs

A.1 Proof of proposition 1

Proof.

A.2 Proof of proposition 2

Proof.

A.3 Proof of proposition 3

Proof.

A.4 Proof of proposition 4

Proof.

A.5 Proof of theorem 2

Proof.

A.6 Proof of theorem 3

Proof.

A.7 Proof of proposition5

Proof.

A.8 Proof of proposition6

Proof.

Appendix B Pseudo code

Proposition 1.

Proposition 2.

Proposition 3.

Theorem 1.

Proposition 4.

Theorem 2.

*Remark 2.1**.*

Theorem 3.

Proposition 5 (Exponential form).

Proposition 6 ((Independence of Bernoulli outcomes)).

*Remark 2.2**.*

*Remark 2.3**.*