Adaptative Inference Cost With Convolutional Neural Mixture Models

Adria Ruiz; Jakob Verbeek

arXiv:1908.06694·cs.CV·August 20, 2019

Adaptative Inference Cost With Convolutional Neural Mixture Models

Adria Ruiz, Jakob Verbeek

PDF

TL;DR

This paper introduces Convolutional Neural Mixture Models (CNMMs), a probabilistic framework that efficiently combines multiple CNNs to adapt inference costs dynamically, achieving high accuracy with flexible computational trade-offs.

Contribution

The paper proposes CNMMs, a novel probabilistic model that enables dynamic pruning of CNN subsets for efficient inference without re-training.

Findings

01

Achieves excellent accuracy-compute trade-offs in image classification and segmentation.

02

Provides a wide range of operating points along the accuracy-cost spectrum.

03

Allows inference cost adaptation without re-training.

Abstract

Despite the outstanding performance of convolutional neural networks (CNNs) for many vision tasks, the required computational cost during inference is problematic when resources are limited. In this context, we propose Convolutional Neural Mixture Models (CNMMs), a probabilistic model embedding a large number of CNNs that can be jointly trained and evaluated in an efficient manner. Within the proposed framework, we present different mechanisms to prune subsets of CNNs from the mixture, allowing to easily adapt the computational cost required for inference. Image classification and semantic segmentation experiments show that our method achieve excellent accuracy-compute trade-offs. Moreover, unlike most of previous approaches, a single CNMM provides a large range of operating points along this trade-off, without any re-training.

Tables1

Table 1. Table 1: Comparison of the results obtained in CIFAR100 by approximating the CNMM output using our approach or a Monte-Carlo procedure with different number of samples N 𝑁 N .

Approximation	FLOPs	Top-1 Accuracy
Expectation (used)	93M	74.4
Sampling N=1	93M	71.2
Sampling N=5	463M	74.4
Sampling N=15	1390M	74.5
Sampling N=30	2780M	74.6

Equations19

F (H_{0}) = f_{T - 1}^{T} (\dots (f_{1}^{2} (f_{0}^{1} (H_{0})))),

F (H_{0}) = f_{T - 1}^{T} (\dots (f_{1}^{2} (f_{0}^{1} (H_{0})))),

p (H_{T} ∣ H_{0})

p (H_{T} ∣ H_{0})

F (H_{0}) = f_{s_{T - 1}}^{s_{T}} (\dots (f_{s_{1}}^{s_{2}} (f_{s_{0}}^{s_{1}} (H_{0})) .

F (H_{0}) = f_{s_{T - 1}}^{s_{T}} (\dots (f_{s_{1}}^{s_{2}} (f_{s_{0}}^{s_{1}} (H_{0})) .

p (s_{0 : T}) = p (s_{T}) t = 1 \prod T p (s_{t - 1} ∣ s_{t}) .

p (s_{0 : T}) = p (s_{T}) t = 1 \prod T p (s_{t - 1} ∣ s_{t}) .

p (s_{t - 1} ∣ s_{t}) = ⎩ ⎨ ⎧ π_{t - 1}^{s_{t}} 1 - π_{t - 1}^{s_{t}} 0 if s_{t - 1} = (t - 1), if s_{t - 1} = s_{t}, otherwise .

p (s_{t - 1} ∣ s_{t}) = ⎩ ⎨ ⎧ π_{t - 1}^{s_{t}} 1 - π_{t - 1}^{s_{t}} 0 if s_{t - 1} = (t - 1), if s_{t - 1} = s_{t}, otherwise .

\displaystyle p(\mathbf{H}_{t}|s_{t},\mathbf{H}_{0})=\sum_{\mathclap{\mathbf{H}_{t-1},{s}_{t-1}}}\Big{[}p(\mathbf{H}_{t}|

\displaystyle p(\mathbf{H}_{t}|s_{t},\mathbf{H}_{0})=\sum_{\mathclap{\mathbf{H}_{t-1},{s}_{t-1}}}\Big{[}p(\mathbf{H}_{t}|

\displaystyle\times\underbrace{p(\mathbf{H}_{t-1}|{s}_{t-1},\mathbf{H}_{0})}_{\text{Recurrent term}}\Big{]},

\tilde{h}_{t}^{s_{t}} = \tilde{π}_{t - 1}^{s_{t}} f_{t - 1}^{s_{t}} (\tilde{h}_{t - 1}^{t - 1}) + (1 - \tilde{π}_{t - 1}^{s_{t}}) \tilde{h}_{t - 1}^{s_{t}},

\tilde{h}_{t}^{s_{t}} = \tilde{π}_{t - 1}^{s_{t}} f_{t - 1}^{s_{t}} (\tilde{h}_{t - 1}^{t - 1}) + (1 - \tilde{π}_{t - 1}^{s_{t}}) \tilde{h}_{t - 1}^{s_{t}},

\displaystyle\mathcal{L}_{\text{single}}(\psi,\theta)=\sum_{n=1}^{N}\mathbb{E}_{p({\mathbf{H}}_{T}|\mathbf{H}_{0}=\mathbf{X_{n}};\psi)}\Big{[}\mathcal{L}\left(y_{n},{\mathbf{H}}_{T},\theta\right)\Big{]},

\displaystyle\mathcal{L}_{\text{single}}(\psi,\theta)=\sum_{n=1}^{N}\mathbb{E}_{p({\mathbf{H}}_{T}|\mathbf{H}_{0}=\mathbf{X_{n}};\psi)}\Big{[}\mathcal{L}\left(y_{n},{\mathbf{H}}_{T},\theta\right)\Big{]},

\displaystyle\mathcal{L}(\theta,\psi)\!=\!\sum_{n=1}^{N}\sum_{t=1}^{T}\mathbb{E}_{p({\mathbf{H}}_{t}|s_{t}=T,\mathbf{X_{n}};\psi)}\Big{[}\mathcal{L}\big{(}y_{n},{\mathbf{H}}_{t},\theta_{t}\big{)}\Big{]},

\displaystyle\mathcal{L}(\theta,\psi)\!=\!\sum_{n=1}^{N}\sum_{t=1}^{T}\mathbb{E}_{p({\mathbf{H}}_{t}|s_{t}=T,\mathbf{X_{n}};\psi)}\Big{[}\mathcal{L}\big{(}y_{n},{\mathbf{H}}_{t},\theta_{t}\big{)}\Big{]},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptative Inference Cost With

Convolutional Neural Mixture Models

Adria Ruiz Jakob Verbeek

Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

[email protected]

Abstract

Despite the outstanding performance of convolutional neural networks (CNNs) for many vision tasks, the required computational cost during inference is problematic when resources are limited. In this context, we propose Convolutional Neural Mixture Models (CNMMs), a probabilistic model embedding a large number of CNNs that can be jointly trained and evaluated in an efficient manner. Within the proposed framework, we present different mechanisms to prune subsets of CNNs from the mixture, allowing to easily adapt the computational cost required for inference. Image classification and semantic segmentation experiments show that our method achieve excellent accuracy-compute trade-offs. Moreover, unlike most of previous approaches, a single CNMM provides a large range of operating points along this trade-off, without any re-training.

1 Introduction

Convolutional neural networks (CNNs) form the basis of many state-of-the-art computer vision models. Despite their outstanding performance, the computational cost of inference in these CNN-based models is typically very high. This holds back applications on mobile platforms, such as autonomous vehicles, drones, or phones, where computational resources are limited, concurrent data-streams need to be processed, and low-latency prediction is critical.

To accelerate CNNs we can reduce their complexity before training, e.g. by decreasing the number of filters or network layers. This solution, however, may lead to sub-optimal results given that over-parametrization plays a critical role in the optimization of deep networks [7, 9]. Fortunately, other studies have found a complementary phenomena: given a trained CNN, a large number of its filters are redundant and do not have a significant impact on the final prediction [26]. Motivated by these two findings, much research has focused on accelerating CNNs using network pruning [11, 19, 22, 35, 38, 41, 47, 56]. Pruning can be applied at multiple levels, e.g. by removing independent filters [35, 41], groups of them [11, 19], or entire layers [56]. Despite the encouraging results of these methods, their ability to provide a wide range of operating points along the trade-off between accuracy and computation is limited. The reason is that these approaches typically require to train a separate model for each specific pruning level.

In this paper, we propose Convolutional Neural Mixture Models (CNMMs), which provide a novel perspective on network pruning. A CNMM define a distribution over a large number of CNNs. The mixture is naturally pruned by removing networks with low probabilities, see Figure 1. Despite the appealing simplicity of this approach, it presents several challenges. First, learning a large ensemble of CNNs may require a prohibitive amount of computation. Second, even if many networks in the mixture are pruned, their independent evaluation during inference is likely to be less efficient than computing the output of a single large model.

In order to ensure tractability, we design a parameter-sharing scheme between different CNNs. This enables us to (i) jointly train all the networks, and (ii) efficiently compute an approximation of the mixture output without independently evaluating all the networks.

Image classification and semantic segmentation experiments show that CNMMs achieve an excellent trade-off between prediction accuracy and computational cost. Unlike most previous network pruning approaches, a single CNMM model achieves a wide range of operating points along this trade-off without any re-training.

2 Related work

Neural network ensembles. Learning ensembles of neural networks is a long-standing research topic. Seminal works explored different strategies to combine the outputs of different networks to obtain more accurate predictions [30, 46, 61]. Recently, the success of deep models has renewed interest in ensemble methods.

For this purpose, many approaches have been explored. For instance, [31, 62] used bagging [3] and boosting [49] to train multiple networks. Other works have considered to learn diverse models by employing different parameter initializations [34], or re-training a subset of layers [60]. While these strategies are effective to learn diverse networks, their main limitation is the required training cost. In practice, training a deep model can take multiple days, and therefore large ensembles may have a prohibitive cost. To reduce the training time, it has been suggested [18, 39] to train a single network and to use parameters from multiple iterations of the optimization process to define the ensemble. Despite the efficiency of this method during training, this approach does not reduce inference cost, since multiple networks must be evaluated independently at test time.

An alternative strategy to allow efficient training and inference is to use implicit ensembles [11, 21, 32, 47]. By relying on sampling, these methods allow to jointly train all the individual components in the ensemble and perform approximate inference during testing. Bayesian neural networks (BNNs) fall in this paradigm and use a distribution over parameters, rather than a single point estimate [11, 28, 40, 47]. A sample from the parameter distribution can be considered as an individual network. Other works have implemented the notion of implicit ensembles by using dropout [52] mechanisms. Dropping neurons can be regarded as sampling over a large ensemble of different networks [2]. Moreover, scaling outputs during testing according to the dropout probability can be understood as an approximated inference mechanism. Motivated by this idea, different works have applied dropout over individual weights [10], network activations [50], or connections in multi-branch architectures [12, 32]. Interestingly, it has been observed that ResNets [13] behave like an ensemble of models, where some residual connections can be removed without significantly reducing prediction accuracy [55]. This idea was used by ResNets with stochastic depth [21], where different dropout probabilities are assigned to the residual connections.

Our proposed Convolutional Neural Mixture Model is an implicit ensemble defining a mixture distribution over an exponential number of CNNs. This allows to use the learned probabilities to prune the model by removing non-relevant networks. Using a mixture of CNNs for model pruning is a novel approach, which contrasts to previous methods employing ensembles for other purposes such as boosting performance [8, 34], improving learning dynamics [21], or uncertainty estimation [24, 31].

Efficient inference in deep networks. A number of strategies have been developed to reduce the inference time of CNNs, including the design of efficient convolutional operators [16, 23, 58], knowledge distillation [4, 15], neural architecture search [14, 63], weight compression [43, 54], and quantization [25, 36]. Network pruning has emerged as one of the most effective frameworks for this purpose [11, 33, 35, 56]. Pruning methods aim to remove weights which do not have a significant impact on the network output. Among these methods, we can differentiate between two main strategies: online and offline pruning. In offline pruning, a network is first optimized for a given task using standard training. Subsequently, non-relevant weights are identified using different heuristics including their norm [35], similarity to other weights [51], or second order derivatives [33]. The main advantage of this strategy is that it can be applied to any pre-trained network. However, these approaches require a costly process involving several prune/retrain cycles in order to recover the original network performance. Online approaches, on the other hand, perform pruning during network training. For example, sparsity inducing regularization can be used over individual weights [38, 40], groups of them [11, 19, 47], or over the connections in multi-branch architectures [1, 56]. These methods typically have a hyper-parameter, to be set before training, determining the trade-off between the final performance and the pruning ratio.

In contrast to previous approaches, we prune entire CNNs by removing the networks with the smallest probabilities in the mixture. This approach offers two main advantages. First, it does not require to define a hyper-parameter before training to determine the balance between the potential compression and the final performance. Second, the number of removed networks can be controlled after optimization. Therefore, a learned CNMM can be deployed at multiple operating points to trade-off computation and prediction accuracy. For example, across different devices with varying computational resources, or on the same device with different computational constraints depending on the processor load of other processes. The recently proposed Slimmable Neural Networks [57] have also focused on adapting the accuracy-efficiency trade-off at run time. This is achieved by embedding a small set of CNNs with varying widths into a single model. Different from this approach, our CNMMs embed a large number of networks with different depths, which allows for a finer granularity to control the computational cost during pruning.

3 Convolutional Neural Mixture Models

Without loss of generality, we consider a CNN as a function $\mathcal{F}(\mathbf{H}_{0})=\mathbf{H}_{T}$ mapping an RGB image $\mathbf{H}_{0}\in\mathbb{R}^{W_{0}\times H_{0}\times 3}$ to a tensor $\mathbf{H}_{T}\in\mathbb{R}^{W\times H\times C}$ . In particular, we assume that $\mathcal{F}$ is defined as a sequence of $T$ operations:

[TABLE]

where $\mathbf{H}_{t}=f_{t-1}^{t}(\mathbf{H}_{t-1})$ is computed from the previous feature map $\mathbf{H}_{t-1}$ . We assume that the functions $f_{t-1}^{t}$ can be either the identity function, or a standard CNN block composed of different operations such as batch-normalization, convolution, activation functions, or spatial pooling. In this manner, the effective depth of the network, i.e. the number of non-identity layers $f_{t-1}^{t}$ , is at most $T$ .

The output tensor $\mathbf{H}_{T}$ of the CNN is used to make predictions for a specific task. For example, in image classification, a linear classifier over $\mathbf{H}_{T}$ can be used in order to estimate the class probabilities for the entire image. For semantic segmentation the same linear classifier is used for each spatial position in $\mathbf{H}_{T}$ .

Given these definitions, a convolutional neural mixture model (CNMM) defines a distribution over output $\mathbf{H}_{T}$ as:

[TABLE]

where $\mathds{F}=\{\mathcal{F}_{1},\mathcal{F}_{2},...,\mathcal{F}_{K}\}$ is a finite set of CNNs, $p(\mathbf{H}_{T}|\mathbf{H}_{0},\mathcal{F}_{k})$ is a delta function centered on the output $\mathcal{F}_{k}(\mathbf{H}_{0})$ of each network, and $p(\mathcal{F})$ defines the mixing weights over the CNNs in $\mathds{F}$ .

3.1 Modelling a distribution over CNNs

We now define mixtures that contain a number of CNNs that is exponential in the maximum depth $T$ , in a way that allows us to manipulate these mixtures in a tractable manner.

Each component in the mixture is a chain-structured CNN uniquely characterised by a sequence ${s}_{0:T}$ of length $T+1$ , where the sequences are constrained to be a non-decreasing set of integers from [math] to $T$ , i.e. with $s_{0}=0$ , $s_{T}=T$ and $s_{t+1}\geq s_{t}$ . This sequence determines the set of functions that are used in Eq. (1). In particular, given a sequence ${s}_{0:T}$ , the output of the corresponding network is computed as:

[TABLE]

For $i<j$ the function $f_{i}^{j}$ is a convolutional block as described above with its own parameters, while the functions $f_{i}^{i}$ are identity functions that leave the input unchanged.

By, imposing $s_{t-1}\in\{t-1,s_{t}\}$ , there is a one-to-one mapping between sequences ${s}_{0:T}$ and the corresponding CNNs.111In particular, this constraint ensures that, e.g., the network $f_{1}^{4}(f_{0}^{1}(\mathbf{H}_{0}))$ is uniquely encoded by the sequence ‘01444’, ruling out the alternative sequences ‘01144’ and ‘01114’. See Figure 2 (Left). If multiple networks use the same function $f_{i}^{j}$ , these networks share their parameters on this function, which ensures that the total number of parameters of the mixture does not grow exponentially, although there are exponentially many mixture components. For instance, for $T=4$ , the mixture will be composed of eight different networks illustrated in Figure 2 (Left). From the illustration it is easy to see that, in general, the mixture contains $2^{T-1}$ components with shared parameters.

In order to define the probabilities $p(\mathcal{F})$ for each network in the mixture, we define a distribution over sequences ${s}_{0:T}$ as a reversed Markov chain:

[TABLE]

To ensure that sequences have positive probability if and only if they are valid, i.e. satisfy the constraints defined above, we set $p(s_{T}\!=\!T)=p(s_{0}\!=\!0|s_{1})=1$ and define:

[TABLE]

As illustrated in Figure 2 (Right), these constraints generate a binary tree generating valid non-decreasing sequences. The conditional probabilities $p(s_{t-1}|s_{t})$ are modelled by a Bernoulli distribution with probability $\pi_{s_{t-1}}^{s_{t}}$ , indicating whether the previous number in the sequence is $s_{t}$ or $t-1$ .

3.2 Sampling outputs from CNMMs

The graphical model defined in Figure 3 shows that we can sample from the output distribution $p(\mathbf{H}_{T}|\mathbf{H}_{0})$ in Eq. (2) by first generating a sequence from $p({s}_{0:T})$ and then evaluating the associated network with Eq. (3). In the following, we formulate an alternative strategy to sample from the model. This formulation offers two advantages. (i) It is amenable to continuous relaxation, which facilitates learning. (ii) It suggests an iterative algorithm to compute feature map expectations, which can be used instead of sampling for efficient inference.

The conditional $p(\mathbf{H}_{t}|s_{t}=l,\mathbf{H}_{0})$ gives the distribution over $\mathbf{H}_{t}$ across the networks with $s_{t}=l$ . For example, $p(\mathbf{H}_{2}|s_{2}=4,\mathbf{H}_{0})$ consists of two weighted delta peaks, located at $f_{0}^{4}(\mathbf{H}_{0})$ and $f_{0}^{1}(f_{1}^{4}(\mathbf{H}_{0}))$ , respectively. See Figure 2 (Left). These conditional distributions can be expressed as the forwards recurrence:

[TABLE]

where $p(\mathbf{H}_{t}|\mathbf{H}_{t-1},{s}_{t},{s}_{t-1})$ is a delta function centered on $f_{s_{t-1}}^{s_{t}}(\mathbf{H}_{t-1})$ . Therefore, unbiased samples $\tilde{\mathbf{h}}^{s_{t}}_{t}$ from $p(\mathbf{H}_{t}|s_{t},\mathbf{H}_{0})$ can be obtained through sample propagation. Recall from Eq. (5) that, given $s_{t}$ , there are only two possible values of $s_{t-1}$ that remain, namely $s_{t}$ and $t-1$ . As a consequence, the sum over $\mathbf{s}_{t-1}$ in Eq. (3.2) only consists of two terms. Given this observation, samples $\tilde{\mathbf{h}}^{s_{t}}_{t}\sim p(\mathbf{H}_{t}|s_{t},\mathbf{H}_{0})$ can be obtained from samples $\tilde{\mathbf{h}}^{s_{t-1}}_{t-1}\sim p(\mathbf{H}_{t-1}|s_{t-1},\mathbf{H}_{0})$ as:

[TABLE]

where for a given value of $s_{t}$ we sample $s_{t-1}$ from $p(s_{t-1}|s_{t})$ to compute a binary indicator $\tilde{\pi}_{t-1}^{s_{t}}=[\![s_{t-1}=t-1]\!]$ , which signals whether the resulting $\tilde{\mathbf{h}}^{s_{t}}_{t}$ is equal to $\tilde{\mathbf{h}}^{s_{t}}_{t-1}$ or $f_{t-1}^{s_{t}}(\tilde{\mathbf{h}}^{t-1}_{t-1})$ .

Using Eq. (7) we iterative sample from distributions $p(\mathbf{H}_{t}|s_{t},\mathbf{H}_{0})$ for $t=1,\dots,T$ , and for each $t$ we compute samples for $s_{t}=t,\dots,T$ . An illustration of the algorithm is shown in Figure 4. The computational complexity of a complete pass in this iterative process is $O(T(T+1)/2)$ , since for each $t=1,\dots,T$ , we compute $T-t+1$ samples, each of which is computed in $O(1)$ from the samples already computed for $t-1$ . This is roughly equivalent to the cost of evaluating a single network with dense layer connectivity of depth $T$ [20], which has a total of $T(T-1)/2$ connections implemented by the functions $f_{i}^{j}$ .

Sampling outputs from networks of bounded depth. Using the described algorithm, $\tilde{\mathbf{h}}^{T}_{T}\sim p(\mathbf{H}_{T}|s_{T}=T,\mathbf{H}_{0})$ correspond to output tensors $\mathbf{H}_{L}$ sampled from the mixture defined in Eq. (2). Moreover, for any $t$ , samples from $p(\mathbf{H}_{t}|s_{t}=T,\mathbf{H}_{0})$ are output feature maps generated by networks with depth bounded by $t$ . For instance, in Figure 2, samples $\tilde{\mathbf{h}}^{T}_{2}$ are generated with one of the networks coded by the sequences $01444$ and $04444$ .

3.3 Training and inference

We use $\psi$ to collectively denote the parameters of the convolutional blocks $f_{i}^{j}$ and the parameters $\pi_{s_{t-1}}^{s_{t}}$ defining the mixing weights via Eq. (5). Moreover, the parameters of the classifier that predict the image label(s) from the output tensor $\mathbf{H}_{T}$ are denoted as $\theta$ . Given a training set $\mathcal{D}=\{(\mathbf{X}_{1},y_{1}),\dots,(\mathbf{X}_{N},y_{N})\}$ composed of images $\mathbf{X}_{n}$ and labels $y_{n}$ , we optimize the parameters by minimizing

[TABLE]

where $\mathcal{L}\left(y_{n},{\mathbf{H}}_{T},\theta\right)$ is the cross-entropy loss comparing the label $y_{n}$ with the class probabilities computed from $\mathbf{H}_{T}$ . In practice, we replace the expectation over $\mathbf{H}_{T}$ in each training iteration with samples from $p({\mathbf{H}}_{T}|\mathbf{H}_{0}=\mathbf{X_{n}};\psi)$ .

Learning from subsets of networks. As discussed in Section 3.2, samples from the distribution $p(\mathbf{H}_{t}|s_{t}=T,\mathbf{H}_{0})$ correspond to outputs of CNNs in the mixture with depth at most $t$ . In order to improve performance of models with reduced inference time, we explicitly emphasize the loss for such efficient relatively shallow networks. Therefore, we sum the above loss function over the outputs sampled from networks of increasing depth:

[TABLE]

where we use a separate classifier for each $t$ . In practice, we balance each loss with a weight increasing linearly with $t$ .

Relaxed binary variables with concrete distributions. The recurrence in Eq. (7) requires sampling from $p(s_{t-1}|s_{t})$ , defined in Eq. (5). The sampling renders the parameters $\pi_{t-1}^{s_{t}}$ non-differentiable, which prevents gradient-based optimization for them. To address this limitation, we use a continuous relaxation by modelling $p(s_{t-1}|s_{t})$ as a binary “concrete” distribution [42]. In this manner, we can use the re-parametrization trick [27, 48] to back-propagate gradients w.r.t. samples $\tilde{\pi}_{t-1}^{s_{t}}$ in Eq. (7) and, thus to compute gradients for the parameters $\pi_{t-1}^{s_{t}}$ .

Efficient inference by expectation propagation. Once the CNMM is trained, the predictive distribution on $y$ is given by $p(y|\mathbf{X};\theta)=\mathbb{E}_{p(\mathbf{H}_{T}|\mathbf{X})}[p(y|\mathbf{H}_{T};\theta)]$ . The expectation is intractable to compute exactly, contrary to our goal of efficient inference. A naive Monte-Carlo sample approximation is still requires multiple evaluations of the full CNMM. Instead, we propose an alternative approximation by propagating expectations instead of samples in Eq. (7), i.e. using the approximation $\bar{\mathbf{H}}_{T}\approx p(\mathbf{H}_{T}|\mathbf{X})$ , where $\bar{\mathbf{H}}_{T}$ is obtained by running the sampling algorithm replacing the samples $\tilde{\pi}_{t-1}^{s_{t}}$ with their expectations ${\pi}_{t-1}^{s_{t}}$ .

3.4 Accelerating CNNMs

CNMMs offer two complementary mechanisms in order to accelerate inference. We describe both in the following.

Evaluating intermediate classifiers. The different classifiers $\theta_{t}$ learned by minimizing Eq. (9) operate over the outputs of a mixture of networks with maximum depth $t$ . Therefore, at each iteration $t$ of the inference algorithm in Eq. (7) we can already output predictions based on classifier $\theta_{t}$ . This strategy is related with the one employed in multi-scale dense networks (MSDNets) [17], where “early-exit” classifiers are used to provide predictions at various points in time during the inference process.

Network pruning. A complementary strategy to accelerate CNMMs is to remove networks from the mixture. The computational cost of the inference process is dominated by the evaluation of the CNN blocks $f_{t-1}^{s_{t}}(\tilde{\mathbf{h}}^{t-1}_{t-1})$ in Eq. (7). However, these function does not need to be computed when the variable $\tilde{\pi}_{t-1}^{s_{t}}=0$ . Therefore, a natural approach to prune CNMMs is to set certain ${\pi}_{t-1}^{s_{t}}$ to zero, removing all the CNNs from the mixture that use $f^{s_{t}}_{t-1}$ . We use the learned distribution $p({s}_{0:T})$ in order to remove networks with a low probability. Note that for a given value of $s_{t}$ , the pairwise marginal $p(s_{t},s_{t-1}\!=\!t-1)$ is exactly the sum of probabilities of all the networks involving the function $f^{s_{t}}_{{t-1}}$ . Using this observation, we use an iterative pruning algorithm where, at each step, we compute all pairwise marginals $p(s_{t},s_{t-1}\!=\!t-1)$ for all possible values of $s_{t}$ and $t$ . We then set ${\pi}_{t^{\star}-1}^{s_{t}^{\star}}=0$ where $(s_{t}^{\star}$ , $t^{\star})=\arg\min_{(s_{t},t)}p(s_{t},s_{t-1}\!=\!t-1)$ . Finally, the marginals are updated, and we iterate.

In this manner, we achieve different pruning levels by progressively removing convolutional blocks that will not be evaluated during inference. This process does not require any re-training of the model, allowing to dynamically set different pruning ratios. Note that this process is complementary to the use of intermediate classifiers, as discussed above. The reason for this is that our pruning strategy may be used to remove functions $f_{i}^{j}$ for any “early” prediction step $t<T$ . Finally, it is interesting observe that the proposed pruning mechanism can be regarded as a form of neural architecture search [37, 63], where the optimal network connectivity for a given pruning ratio is automatically discovered by taking into account the learned probabilities $p(s_{t-1}\!=\!t-1|s_{t})$ .

4 Experiments

We perform experiments over two different tasks: image classification and semantic segmentation. Following previous work, we measure the computational cost in terms of the number of floating point multiply and addition operations (FLOPs) required for inference. The number of FLOPs provides a metric that correlates very well with the actual inference wall-time, while being independent of implementation and hardware used for evaluation.

4.1 Datasets and experimental setup

CIFAR-10/100 datasets. These datasets [29] are composed of 50k train and 10k test images with a resolution of 32 $\times$ 32 pixels. The goal is to classify each image across 10 or 100 classes, respectively. Images are normalized using the means and standard deviations of the RGB channels. We apply standard data augmentation operations: (i) a $4$ -pixel zeros padding followed by 32 $\times$ 32 cropping. (ii) Random horizontal flipping with probability $0.5$ . Performance is evaluated in terms of the mean accuracy across classes.

CityScapes dataset. This dataset [6] contains 1024 $\times$ 2048 pixel images of urban scenes with pixel-level labels across 19 classes. The dataset is split into training, validation and test sets with 2,975, 500 and 1,525 samples each. The ground-truth annotations for the test set are not public, and we use the validation set instead for evaluation. To assess performance we use the standard mean intersection-over-union (mIoU) metric. We follow the setup of [45], and down-sample the images by a factor two before processing them. As a data augmentation strategy during training, we apply random horizontal flipping and resizing by using a scaling factor between 0.75 and 1.1. Finally, we use random crops of 384 $\times$ 768 pixels from the down-sampled images.

Base architecture. As discussed in Section 3.2, the learning and inference algorithms for CNMMs can be implemented using a network with dense layer connectivity [20]. Based on this observation, we use an architecture similar to MSDNets [17]. Specifically. we define a set of $B$ blocks, each composed of a set of $S$ feature maps $\mathbf{H}_{t}$ . See Fig. (5).

The initial feature map in each block has $C$ channels and, at each subsequent feature map in the block, the spatial resolution is reduced by a factor two in each dimension, and the number of channels is doubled. Feature maps are connected by functions $f_{i}^{j}$ if the output feature map $\mathbf{H}_{j}$ has the same or half the resolution of the input feature map $\mathbf{H}_{i}$ . Finally, we consider the output tensor $\mathbf{H}_{T}$ to have different connectivity and spatial resolution depending on the task.

Implementation for image classification. We implement the convolutional layers as the set of operations (BN-ReLU-DConv-BN-ReLU-Conv-BN), where BN refers to batch normalization, DConv is a $(3\times 3\times C\times\frac{C}{4})$ depth-wise separable convolution [16], and Conv is a $(1\times 1\times\frac{C}{4}\times C)$ convolution. In order to reduce computation, for a given tensor $\mathbf{H}_{i}$ , the different functions $f_{i}^{j}$ share the parameters of the initial operations (BN-ReLU-DConv) for all $j$ . Moreover, when the resolution of the feature map is reduced, we use average pooling after these three initial operations. In all our experiments, the number of initial channels in $\mathbf{H}_{1}$ is set to $C=64$ . This is achieved by using a $(3\times 3\times 3\times\frac{C}{4})$ convolution over the input image $\mathbf{H}_{0}$ and then apply a $(1\times 1\times\frac{C}{4}\times C)$ convolutional block. Finally, all the tensors $\mathbf{H}_{l}$ with the lowest spatial resolution are connected to the output $\mathbf{H}_{L}$ . Concretely, $\mathbf{H}_{T}$ is a vector $\mathbb{R}^{512}$ obtained by applying the operations (BN-ReLU-GP-FC-BN) to the input tensors, where GP refers to global average pooling, and FC corresponds to a fully-connected layer. The classifier $\theta$ maps $\mathbf{H}_{T}$ linearly to a vector of dimension equal to the number of classes. When using Eq. (9) to train the CNMM, we connect a classifier $\theta_{t}$ with the end of each block.

Implementation for semantic segmentation. We use the same setup as for image classification, but replace the ReLU activations with parametric ReLUs as in [44]. Moreover, we use max instead of average pooling to reduce the spatial resolution. The input tensor $\mathbf{H}_{1}$ has $C=48$ channels and a resolution four times lower than the original images. This is achieved by applying a $(3\times 3\times 3\times\frac{C}{4})$ convolution with stride $2$ to the input and then using a (BN-ReLU-Conv) block followed by max pooling. The output tensor $\mathbf{H}_{L}$ receives connections from all the previous feature maps and has the same channels and spatial resolution as $\mathbf{H}_{1}$ . Given that the input feature maps are at different scales, we apply a (BN-PReLU-Conv-BN) block over the input tensor and use bi-linear up-sampling with different scaling factors in order to recover the original spatial resolution. The final classifier $\theta$ computing the class probabilities using $\mathbf{H}_{T}$ are defined as blocks (UP-BN-PReLU-Conv-UP-BN-PReLU-DConv), where UP refers to bilinear upsampling, which allows to recover the original image resolution. The first and second convolutions in the block have $12$ and $K$ output channels respectively, where $K$ is the number of classes. As in image classification, we use an intermediate classifier $\theta_{t}$ at each step $t$ where a full block of computation is finished.

Optimization details. In our experiments, we use SGD with momentum by setting the initial learning rate to 0.1 and weight decay to $10^{-4}$ . In CIFAR, we use a cosine annealing schedule to accelerate convergence. On the other hand, in Cityscapes we employ a cyclic schedule with warm restarts as in [45]. The temperature of the concrete distributions modelling $p(s_{t-1}|s_{t})$ is set to 2. We train our model by using 300 and 200 epochs, and batch size of 64 and 6, for respectively CIFAR-10/100 and Cityscapes. For the cyclic scheduler, the learning rate is divided by two at epochs $\{30,60,90,120,150,170,190\}$ . Additionally, the models trained in Cityscapes are fine-tuned during 10 epochs by using random crops of size 512 $\times$ 1024 instead of 384 $\times$ 768.

4.2 Pruning and intermediate classifiers

We evaluate the proposed pruning and intermediate classifiers strategies to reduce the inference time of trained CNMMs. For CIFAR-10/100 we learn a CNMM with $B=6$ blocks, using $S=3$ scales each. For Cityscapes we use $B=5$ blocks and $S=5$ scales. For each dataset, we train one model that uses a single classifier $\theta$ , optimized using Eq. (8). In addition, we train a second model with intermediate classifiers $\theta_{t}$ , minimizing the loss function in Eq. (9). In the following, we will refer to the first and second variant as CNMM-single and CNMM respectively.

In Figure 6 we report prediction accuracy vs. FLOPs for inference. Each model is represented as a curve, traced by pruning the model to various degrees. Across the three datasets, the CNMM model with intermediate classifiers achieves higher accuracy in fast inference settings than the CNMM-single model. Recall that all the operation point across the different CNMM curves are obtained from a single trained model. Therefore, this single model can realize the upper-envelope of the performance curves. As expected, the maximum performance of the intermediate classifiers increases with the step number. The accuracy of CNMM at the final step is comparable to the level obtained by the CNMM-single model: slightly worse on CIFAR-10, and slightly better at CIFAR-100 and CityScapes. This is because the minimized intermediate losses provide additional supervisory signal which is particularly useful to encourage accurate prediction for shallow, but fast, CNNs. In conclusion, the CNMM model with intemediate classifiers is to be preferred, since it provides a better trade-off between accuracy and computation at a wider range of FLOP counts.

By analysing the operating points along each curve, we can observe the effectiveness of the proposed pruning algorithm. For the CIFAR datasets we can reduce the FLOP count by a factor two without significant loss in accuracy. For CityScapes, about 25% pruning can be achieved without a significant loss. In general, if several exit points can achieve the same FLOP count by applying varying amounts of pruning, best performance is obtained pruning less for an earlier classifier, rather than pruning more for a later exit.

4.3 Comparison with the state of the art

Image classification. We compare our model with different state-of-the-art CNN acceleration strategies [17, 19, 22, 38, 56]. We consider methods applying pruning at different levels, such as independent filters (Network slimming [38]), groups of weights (CondenseNet) [19], connections in multi-branch architectures (SuperNet) [56], or a combination of them (SSS) [22]. We also compare our method with any-time models based on early-exit classifiers (MSDNet) [17]. Among other previous state-of-the-art methods, the compared approaches have shown the best performance among efficient inference methods with $\leq$ 200 million FLOPs. We compare to CNMMs using 6 and 12 blocks, using three scales is both cases.

The results in Figure 7 (left, center) show that CNMMs achieve similar or better accuracy-compute trade-off across a broad range of FLOP counts than all the compared methods in the CIFAR datasets. Only CondenseNets shows somewhat better performance for medium FLOP counts. Moreover, note that the different operating points shown for the compared methods (except for MSDNets) are obtained by using different models trained independently, e.g. by different settings of a hyper-parameter controlling the pruning ratio. In contrast, CNMM embeds a large number operating points in a single model. This feature is interesting when the available computational budget can change dynamically, based on concurrent processes, or when the model is deployed across a wide range of devices. In these scenarios, a single CNMM can be accelerated on-the-fly depending on the available resources. Note that a single MSDNet is also able to provide early-predictions by using intermediate classifiers. However, our CNMM provides better performance for a given FLOP count and allows for a finer granularity to control the computational cost.

Semantic segmentation. State-of-the-art methods for real-time semantic segmentation have mainly focused on the manual-design of efficient network architectures. By employing highly optimized convolutional modules, ESPNet [44] and ESPNetv2 [45] have achieved impressive accuracy-computation trade-offs. Other methods, such as [5, 59], offer higher accuracy but at several orders of magnitude higher inference cost, limiting their application in resource constrained scenarios.

In Figure 7 (right) we compare our CNMM results to these two approaches. Note that the original results reported in [45] are obtained by using a model pre-trained in ImageNet. For a fair comparison with our CNMMs, we have trained EspNetv2 from scratch by using the code provided by the authors 222https://github.com/sacmehta/EdgeNets. As can be observed, CNMM provides a better trade-off compared to ESPNet. In particular, a full CNMM without pruning obtains an improvement of 0.5 points of mIoU, while reducing the FLOP count by 45%. Moreover, an accelerated CNMM achieves a similar performance compared to the most efficient ESPNet that needs more than two times more FLOPs. On the other hand, ESPNetv2 gives slightly better trade-offs compared to our CNMMs. However, this model relies on an efficient inception-like module [53] that also includes group point-wise and dilated convolutions. These are orthogonal design choices that can be integrated in our model as well, and we expect that to further improve our results. Additionally, the different operating points in ESPNet and ESPNetv2 are achieved using different models trained independently. Therefore, unlike our approach, these methods do not allow for a fine-grained control over the accuracy-computation trade-off, and multiple models need to be trained. Figure 8 shows qualitative results using different operating points from a single CNMM.

5 Conclusions

We proposed to address model pruning by using Convolutional Neural Mixture Models (CNMMs), a novel probabilistic framework that embeds a mixture of an exponential number of CNNs. In order to make training and inference tractable, we rely on massive parameter sharing across the models, and use concrete distributions to differentiate across the discrete sampling of mixture components. To achieve efficient inference in CNMM we use an early-exit mechanism that allows prediction after evaluating only a subset of the networks. In addition, we use a pruning algorithm to remove CNNs that have low mixing probabilities. Our experiments on image classification and semantic segmentation tasks show that CNMMs achieve excellent trade-offs between prediction accuracy and computational cost. Unlike most of previous works, a single CNMM model allows for a large number and wide range of accuracy-compute trade-offs, without any re-training.

Acknowledgements

This work is supported by ANR grants ANR-16-CE23-0006 and ANR-11-LABX-0025-01.

Appendix A Supplementary material

We provide further results on CIFAR100 in order to show the importance of all components of our proposed CNMMs. Moreover, we provide additional qualitative results of semantic segmentation on the CityScapes dataset.

A.1 Ablative study of CNMM

Using sampling during training. During learning, CNMMs generate a set of samples $\tilde{\pi}^{s_{t}}_{{t-1}}$ using Eq. (7). In contrast, during inference we use the expectations $\pi^{s_{t}}_{{t-1}}$ instead. In order to evaluate the importance of sampling during learning, we have optimized a CNMM by using the aforementioned expectations instead of samples. Figure 9 shows the results obtained by the model using this approach, denoted as “Training with expectations”. We observe that, compared to the CNMM using sampling, the accuracy decreases faster when different pruning ratios are applied. We attribute this to the fact that our sampling procedure can be regarded as a continuous-relaxation of dropout, where a subset of functions $f_{t-1}^{s_{t}}(\mathbf{h}_{t-1}^{t-1})$ are randomly removed when computing the output tensor $\mathbf{h}_{t-1}^{s_{t}}$ . As a consequence, the learned model is more robust to the pruning process where some of the convolutional blocks are removed during inference. This is not the case when deterministic expectations are used in Eq. (7) rather than samples.

Comparison with a deterministic model. We compare the performance of our CNMM with a deterministic variant using the same architecture. Concretely, in Eq. (7) we ignore samples $\tilde{\pi}^{s_{t}}_{{t-1}}$ and simply sum the feature maps $\tilde{\mathbf{h}}^{s_{t}}_{t-1}$ and $\tilde{\mathbf{h}}^{t-1}_{t-1}$ . Note that the resulting model is analogous to a MSDNet [17] using early-exit classifiers. We report the results in Figure 9, denoted as “Deterministic with early-exits”. We observe that our CNMM model obtains better performance than its deterministic counterpart. Moreover, same as MSDNets, accelerating the deterministic model is only possible by using the early-exits. In contrast, the complementary pruning algorithm available in CNMM allows for a finer granularity to control the computational cost.

Expectation approximation during inference. In order to validate our approximation of $p(y|\mathbf{X};\theta)=\mathbb{E}_{p(\mathbf{H}_{T}|\mathbf{X})}[p(y|\mathbf{H}_{T};\theta)]$ during inference, we evaluate the performance obtained by using a Monte-Carlo procedure for the same purpose. In particular, we generate $N$ samples from the output distribution $p(\mathbf{H}_{T}|\mathbf{H}_{0})$ . Then, we compute the class probabilities $p(y|\mathbf{H}_{T};\theta)$ for each sample and average them. Table 1 shows the results obtained by varying the number of samples. We observe that our approach offers a similar performance as the Monte-Carlo approximation using $N=5$ . For a higher number of samples, we observe slight improvements in the results. However, note that a Monte-Carlo approximation is very inefficient since it requires $N$ independent evaluations of the model.

In particular, the last row in Table 1 is 30 times more costly to obtain than the two first rows. The minimal gain obtained with more samples could probably be more efficiently obtained by using a larger model.

A.2 Additional Qualitative Results

In Figure 10 we provide additional qualitative results for semantic segmentation obtained by a single trained CNMM model, using various opertating points with different number of FLOPs during inference.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Karim Ahmed and Lorenzo Torresani. Maskconnect: Connectivity learning by gradient descent. In ECCV , 2018.
2[2] Pierre Baldi and Peter J Sadowski. Understanding dropout. In Neur IPS , 2013.
3[3] Leo Breiman. Bagging predictors. Machine learning , 1996.
4[4] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. In Neur IPS , 2017.
5[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI , 2018.
6[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR , 2016.
7[7] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. ICLR , 2019.
8[8] Anuvabh Dutt, Denis Pellerin, and Georges Quénot. Coupled ensembles of neural networks. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI) , 2018.