Trading-off Accuracy and Energy of Deep Inference on Embedded Systems: A   Co-Design Approach

Nitthilan Kannappan Jayakodi; Anwesha Chatterjee; Wonje Choi,; Janardhan Rao Doppa; Partha Pratim Pande

arXiv:1901.10584·cs.CV·January 31, 2019

Trading-off Accuracy and Energy of Deep Inference on Embedded Systems: A Co-Design Approach

Nitthilan Kannappan Jayakodi, Anwesha Chatterjee, Wonje Choi,, Janardhan Rao Doppa, Partha Pratim Pande

PDF

TL;DR

This paper introduces a co-design approach with Coarse-to-Fine Networks (C2F Nets) that dynamically adjusts classifier complexity during inference to optimize energy consumption without sacrificing accuracy on embedded systems.

Contribution

It presents a formalism and optimization algorithm for configuring C2F Nets, enabling adaptive classifier selection based on input difficulty to balance accuracy and energy efficiency.

Findings

01

Reduced Energy Delay Product by 27-60%

02

No accuracy loss compared to baseline

03

Effective on multiple real-world image tasks

Abstract

Deep neural networks have seen tremendous success for different modalities of data including images, videos, and speech. This success has led to their deployment in mobile and embedded systems for real-time applications. However, making repeated inferences using deep networks on embedded systems poses significant challenges due to constrained resources (e.g., energy and computing power). To address these challenges, we develop a principled co-design approach. Building on prior work, we develop a formalism referred to as Coarse-to-Fine Networks (C2F Nets) that allow us to employ classifiers of varying complexity to make predictions. We propose a principled optimization algorithm to automatically configure C2F Nets for a specified trade-off between accuracy and energy consumption for inference. The key idea is to select a classifier on-the-fly whose complexity is proportional to the…

Equations2

(α^{*}, β^{*}, γ^{*}) = \mbox a r g α, β, γ min O (NN, H, λ)

(α^{*}, β^{*}, γ^{*}) = \mbox a r g α, β, γ min O (NN, H, λ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Trading-off Accuracy and Energy of Deep Inference on Embedded Systems: A Co-Design Approach

Nitthilan Kannappan Jayakodi, , Anwesha Chatterjee, , Wonje Choi, , Janardhan Rao Doppa, , Partha Pratim Pande This work was supported, in part by the US National Science Foundation (NSF) grants CNS-1564014 and CCF 1514269 and USA Army Research Office grant W911NF-17-1-04Nitthilan K. J, Anwesha Chatterjee, Wonje Choi, J. R. Doppa, and P. P. Pande are with the School of Electrical Engineering and Computer Engineering, Washington State University, Pullman, WA 99163 USA (e-mail: [n.kannappanjayakodi, anwesha.chatterjee, wonje.choi, jana.doppa, pande]@wsu.edu).This article was presented in the International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS) and appears as part of the ESWEEK-TCAD special issue

Abstract

Deep neural networks have seen tremendous success for different modalities of data including images, videos, and speech. This success has led to their deployment in mobile and embedded systems for real-time applications. However, making repeated inferences using deep networks on embedded systems poses significant challenges due to constrained resources (e.g., energy and computing power). To address these challenges, we develop a principled co-design approach. Building on prior work, we develop a formalism referred as Coarse-to-Fine Networks (C2F Nets) that allow us to employ classifiers of varying complexity to make predictions. We propose a principled optimization algorithm to automatically configure C2F Nets for a specified trade-off between accuracy and energy consumption for inference. The key idea is to select a classifier on-the-fly whose complexity is proportional to the hardness of the input example: simple classifiers for easy inputs and complex classifiers for hard inputs. We perform comprehensive experimental evaluation using four different C2F Net architectures on multiple real-world image classification tasks. Our results show that optimized C2F Net can reduce the Energy Delay Product (EDP) by 27 to 60 percent with no loss in accuracy when compared to the baseline solution, where all predictions are made using the most complex classifier in C2F Net.

Index Terms:

Deep neural networks, Inference, Embedded systems, Approximate computing, Bayesian optimization, Hardware and software co-design

I Introduction

We are witnessing the rise of data-driven systems, where real-time predictions and decisions are being made based on models learned from large-scale data (e.g., text, images, speech, and sensor data). Deep learning — a set of computational techniques to automatically extract patterns and useful features from raw data — played a major role in this data revolution. Success stories of deep neural networks (DNNs) include achieving very high accuracy in image classification, speech recognition, and machine translation [1]; deep reinforcement learning [2]; and computers playing the game of GO against best human players (i.e., AlphaGo) [3].

In spite of the above successes, there are significant challenges for deploying trained DNNs to make predictions on edge devices (e.g., mobiles, Internet of Things, and Wearables) due to their constrained resources including energy and computing power. In practice, high accuracy is achieved by employing large DNNs, where making inference or predictions is computationally expensive and consumes a lot of energy. Unfortunately, this is not compatible with edge applications (e.g., robotics, smart health, and surveillance systems). As we discuss in related work section, most prior work has addressed these challenges using two different approaches from hardware and software perspective: 1) designing high-performance and energy-efficient hardware accelerators for performing inference, and 2) compressing large-scale DNNs with negligible loss in accuracy. In both these approaches the inference for every input example is made using a “fixed” computational process. They do not exploit the fact that hardness of inference varies from one example to another.

In this paper, we develop a co-design approach to overcome the above drawback by performing “adaptive computing”. We build on the general principle of coarse-to-fine computation [4],TagDyn and formalize a model that we refer as Coarse-to-Fine Networks (C2F Nets), which is a generalization of the recently proposed conditional deep learning (CDL) model [5]. C2F Nets allow us to employ classifiers of varying complexity depending on the hardness of input examples. The key idea is to learn to select simpler (coarser) networks to classify “easy” examples and complex (finer) networks to classify “hard” examples. Figure 1 illustrates easy and hard inputs for image classification task. We also provide design principles to guide the design of good C2F nets for a given real-world classification task. We propose a novel and principled stage-wise optimization approach to automatically configure C2F Nets for a specified trade-off between accuracy and energy consumption of inference on a target hardware platform. Our overall co-design approach using C2F Nets is complementary to prior approaches based on hardware and software optimization including the recent work on CDL [5].

We performed comprehensive experiments using four different C2F Nets for image classification tasks. We optimized the C2F Nets for different trade-offs between accuracy and energy consumption of inference on a target hardware platform. Our results show that many input examples can be classified correctly using simple networks and complex networks are used only for hard examples. Our optimized C2F Net reduces the Energy Delay Product (EDP) by 27 to 60 percent with no loss in accuracy when compared to the baseline solution, where all predictions are made using the most complex network in C2F Nets.

Contributions. We make three main contributions in this paper.

Formalize the coarse-to-fine network (C2F Net) model by generalizing the recent work on CDL. This model allows to perform adaptive computing to make inference on input examples of varying hardness. We also provide a methodology to guide the design of C2F networks for practitioners based on fine-grained energy analysis. 2. 2.

Develop a principled optimization algorithm to automatically configure C2F Nets for a specified trade-off between accuracy and energy on a target hardware platform. 3. 3.

Comprehensive experimental evaluation of four C2F Net architectures on real-world image classification to demonstrate the effectiveness of overall approach.

II Related Work

Main challenges for deploying deep learning applications on embedded systems include high resource requirements in terms of memory, computation, and energy. Prior work has addressed these challenges using methods based on hardware, software, and hardware/software co-design [6].

Hardware Approaches. Architecture level methods include design of specialized hardware targeting DNN computations using the data-flow knowledge [7], placement of memory closer to computation units [8], and on-chip integration of memory and computation [9]. Apart from the constraint of having an application-specific hardware system, most of these hardware technologies also have an overhead of analog to digital conversion [6].

Software Approaches. To address the challenges in general purpose CPUs and GPUs, following software-level approaches are studied. Reducing the precision of weights and activations is one such approach [10]. Fixed point representation of weights and activations dynamically varying across different layers is also employed to reduce the energy and area cost of the network [11]. In [12], authors show that using floating-point for weights and fixed-point for activations leads to reduction of power consumption by 50% and reduction in model size by 36%. Another software level approach is pruning and compression of large-scale DNNs [13]. The sparsity in features and activations after non-linear operations like ReLU is exploited to compress the models in [7]. To address significant redundancy in weights during training the network, the authors in [14] employ pruning technique on the weights based on their magnitudes. While magnitude-based pruning of weights reduces the computation cost, it does not directly address the energy cost. To make the model more energy efficient, an “energy-aware” pruning mechanism is proposed in [15], where pruning of weights is performed layer-by-layer based on the knowledge of energy hungry layers. This method is shown to be 1.74 times more energy-efficient when compared to the weight-magnitude based pruning.

A finer-level software approach is to improve the network architecture. The inception module [16] performs a 2D-convolution considering two 1D convolutions considering the 2-D filters to be completely separable. The Xception [17] and Mobilenets [18] architectures also employ depth-wise separable convolutions similar to inception model, but perform an additional point-wise convolution, i.e., a 1x1 convolution at the end to combine the outputs of the depth-wise convolutions. This method helps in reducing the computation and model size.

A recent paper [19] proposed an iterative CNN (ICNN) approach to make multi-stage predictions for images. The key idea is to apply a two-stage wavelet transform to produce multiple small-scale images and train separate models to process them progressively to make multi-stage predictions. Compared to ICNN approach, our C2F nets approach has following advantages: 1) C2F nets is more principled and general solution. We can easily instantiate the framework for different DNN architectures; and 2) Our optimization approach is very efficient and allows to automatically tune C2F nets for different energy-accuracy trade-offs for a target hardware.

Hardware and Software Co-Design Approaches. Many of the above approaches involve design of hardware guided by software level optimization techniques. [11] proposes the design of a compiler specific to FPGA that analyzes the structure and parameters of convolutional neural networks, and generates modules to improve the throughput of the system. In [20], authors propose the design of Energy Inference Engine (EIE) that deploys pruned DNN models. In [21], the knowledge of compressed sparse-weights guides the hardware to read weights and perform multiply-and-accumulate (MAC) computations only for the nonzero activations, thereby reducing the energy cost by 45%.

Coarse-to-Fine and Conditional Inference Approaches. Coarse-to-Fine or cascaded computation is a general principle that has a long history in computer science and machine learning [4, 22]. The application of this general principle to a given problem is non-trivial, and the exact details and algorithmic procedures vary from one application to another including our co-design approach for inference with deep neural networks.

Cascaded CNN architecture for face detection [23] progressively prune areas of the image that are not likely to contain a face. The key differences with our work are as follows: 1) In our C2F architecture, the features computed in the previous level are reused in the next level to construct more complex features, whereas the CNNs used in [23] do not reuse any features; and 2) The approach proposed in [23] do not allow to optimize the cascade for a specified trade-off between speed and accuracy of face detection.

Prior work on coarse-to-fine deep kernel networks [24] tries to combine multiple input kernels in a recursive manner to achieve a complex kernel and associated representation of the data. This paper makes very strict assumptions and incorporates these assumptions directly into the overall learning approach: 1) It can only work for binary classification tasks; 2) It assumes that the number of negative examples are significantly higher than positive examples. The learning approach in [24] for training coarser networks is tuned to binary classification and assigns very high-costs to misclassifying positive examples to incorporate this assumption; and 3) There are no thresholds to configure the architecture to trade-off speed and accuracy of prediction. Similarly, there is no optimization methodology to optimize the architecture for a specified trade-off.

The high-level architecture of branch CNN (BCNN) [25] is similar in style to our C2F model, but there are some significant differences: 1) BCNN uses linear classifier after each layer, whereas we do not; and 2) The approach in [25] doesn’t allow to configure the BCNN for any specified trade-off between speed and accuracy. This method is also very ad hoc as there is no well-defined optimization formulation and optimization methodology.

Recent paper on conditional deep learning (CDL) model [5] is the closest related work to ours. However, there are some significant differences between our work and this paper: 1) CDL model employs a single threshold value at all the levels, whereas we employ different thresholds for different levels. When the number of levels are more, different thresholds for different levels will achieve better pareto-optimal solutions. As discussed later, our experimental results on CIFAR10 and CIFAR100 clearly show this difference; 2) In their training approach, they only train using a subset of the input examples that are not filtered by the previous level. This will likely cause over-fitting and may not work well on harder tasks. We train all the intermediate classifiers on all the training examples to improve their generalization. Additionally, the approach in [5], need to train all the different levels for each confidence threshold, which is very time-consuming. In our case, we train classifiers at different levels only once and tune the thresholds for different levels jointly conditioned on the trained features and classifiers; and 3) We look at the co-design problem of configuring a software application to run on a specific hardware architecture for a specified trade-off between accuracy and energy. Our optimization approach based on Bayesian Optimization principles to configure the C2F net to trade-off accuracy and energy is significantly different and allows us to achieve better pareto-optimal solutions. Our experimental evaluation is much more comprehensive (CIFAR10, CIFAR100, and MNIST datasets with different C2F Nets). We perform in-depth energy analysis of different layers to provide principles to design C2F nets and also fine-grained analysis to understand why the approach works better.

III Coarse-to-Fine Networks (C2F Nets)

In this section, we describe the details of our coarse-to-fine networks (C2F Nets) model. In what follows, we provide the motivation, formal model of C2F Nets along with some examples, and how to make predictions given a fully specified C2F Net model.

III-A Motivation

Simple (shallow) DNNs are energy-efficient, but their accuracy is low. On the other hand, many state-of-the-art DNNs that achieve very high accuracy are highly complex (deep) and consume significant amount of energy. We are motivated by the fact that many input examples are easy and can be classified correctly using simple DNNs and only a small fraction of hard inputs would need complex DNNs. We illustrate this observation using two examples. First, Figure 1 shows bird images corresponding to easy and hard cases. Second, Figure 8 shows the accuracy of different DNNs with varying complexity. We can see that the accuracy improvement is small when we move from simple to complex DNNs, whereas their energy consumption grows significantly. This corroborates our hypothesis. Therefore, our goal is automatically select simple networks for easy inputs and complex networks for hard inputs. This mechanism will allow us to achieve high accuracy with significantly less energy consumption when compared to the baseline approach, where we make predictions for all input examples using the most complex DNN.

III-B C2F Nets Model

A C2F Net model $\mathcal{NN}$ with $T$ levels can be seen as a sequence of networks stacked in the order of increasing complexity as shown in Figure 2. Each level $i$ of the model contains three key elements.

1) Feature transformation function in the form of a network of layers that takes the features computed in previous level $x_{i-1}$ as input and produces more complex features $x_{i}$ . Parameters of the feature function are denoted by $\alpha_{i}$ . Figure 2 and Figure 7 provide illustrative examples for feature transformation functions.

2) Classifier takes the features computed by feature transformation function $x_{i}$ and produces a probability distribution over all candidate labels. Parameters of the classifier are denoted by $\beta_{i}$ . Figure 2 provides an illustration of the classifier block.

3) Confidence threshold. Given a predicted probability distribution over classification labels, we can estimate the confidence of the classifier in a variety of ways [26] including: a) Maximum probability, where we take the probability of the highest score label; b) Entropy over the probability of all candidate labels; c) Margin denoted by the difference in probability scores between best and second-best label. For a given confidence type, the confidence threshold $\gamma_{i}$ is employed to determine if the classifier is confident enough in its prediction to terminate and return the predicted label.

III-C Inference in C2F Nets

Given a C2F Net with all its parameters ( $\alpha$ , $\beta$ , and $\gamma$ ) fully specified, inference computation for a new input example $x$ is performed as follows (see Algorithm 1). We sequentially go through the C2F Net starting from level 1. At each level $i$ , we make predictions using the feature transformation functions parameterized by $\alpha_{i}$ and intermediate classifiers $\beta_{i}$ to compute the probability distribution of classification labels $\hat{P_{i}}(y)$ . We estimate the confidence parameter $t_{i}$ from $\hat{P_{i}}(y)$ and predicted output $\hat{y}$ as the label with highest probability score. If estimated confidence in prediction $t_{i}$ meets the threshold $\gamma_{i}$ or we reach the final level $T$ , we terminate and return the predicted output $\hat{y}$ for the given input $x$ .

IV Optimization Approach for C2F Nets

In this section, we describe a principled optimization algorithm to automatically configure C2F Nets for a specified trade-off between accuracy and energy on a target hardware platform.

Problem Setup. Without loss of generality, let us assume that we are given a C2F Net $\mathcal{NN}$ with $T$ levels. The parameters of $\mathcal{NN}$ can be divided into three parts: 1) Parameters of feature transformation functions at different levels $\alpha$ = ( $\alpha_{1},\alpha_{2},\cdots,\alpha_{T}$ ); 2) Parameters of intermediate classifiers at different levels $\beta$ = ( $\beta_{1},\beta_{2},\cdots,\beta_{T}$ ); and 3) Confidence thresholds at different levels $\gamma$ = ( $\gamma_{1},\gamma_{2},\cdots,\gamma_{T-1}$ ) that decide the complexity of classifier employed for a given input example. We are provided with a training set $D_{train}$ and validation set $D_{val}$ of classification examples drawn from an unknown target distribution $\mathcal{D}$ , where each classification example is of the form $(x,y^{*})$ , where $x$ is the input (e.g., image) and $y$ is the output class (e.g., image label). We are also given a target hardware platform $\mathcal{H}$ for which the C2F model $\mathcal{NN}$ need to be optimized for. We can measure the energy consumption and accuracy of inference a candidate $\mathcal{NN}$ model using the hardware $\mathcal{H}$ . Suppose $\mathcal{O}(\mathcal{NN},\mathcal{H},\lambda)$ = $\mathbb{E}_{(x,y^{*})\sim\mathcal{D}}$ $\lambda\cdot$ Error( $\mathcal{NN}$ , $\mathcal{H}$ ) + (1- $\lambda)\cdot$ Energy( $\mathcal{NN}$ , $\mathcal{H}$ ) stands for the expected error and energy trade-off achieved over the target distribution of classification examples $\mathcal{D}$ when the coarse-to-fine network model $\mathcal{NN}$ is executed with parameters $\alpha$ , $\beta$ , $\gamma$ on the target hardware platform $\mathcal{H}$ . For a specified trade-off parameter $\lambda\in\left[0,1\right]$ , our goal is to find the best parameters $\alpha$ , $\beta$ , $\gamma$ of coarse-to-fine network $\mathcal{NN}$ to achieve the specified trade-off between error and energy for the target platform $\mathcal{H}$ .

[TABLE]

Since we do not have access to the distribution $\mathcal{D}$ , we employ the training and validation set to find the best parameter values. For example, to evaluate a candidate parameter configuration over the validation set, we do the following. We compute the normalized error and normalized energy with respect to the setting where all predictions are made using the largest network over all the input examples in the validation set. We plug the normalized error and energy in the objective $\mathcal{O}$ to evaluate the candidate parameter configuration.

IV-A Stagewise Optimization Algorithm

The optimization problem posed in Equation 1 is extremely challenging to solve due to complex interactions between the parameter values ( $\alpha$ , $\beta$ , $\gamma$ ) on the optimization objective $\mathcal{O}(\mathcal{NN},\mathcal{H},\lambda)$ . Generally, C2F Model $\mathcal{NN}$ would be trained using back-propagation via stochastic gradient descent (SGD) optimization [27]. Specifically, the Objective requires the energy consumption of coarse-to-fine model $\mathcal{NN}$ on the target hardware platform $\mathcal{H}$ , where it is executed. Applying the standard back-propagation training algorithm would require us to estimate the energy for every iteration of the training step, which would make the training procedure impractical.

To overcome these challenges, we leverage the structure in this optimization problem to develop an efficient stagewise optimization algorithm (see Algorithm 2), where the parameters $\alpha$ , $\beta$ , and $\gamma$ are optimized sequentially one after another. By splitting it into multiple stages, we decouple the dependency of the training procedure for $\alpha$ and $\beta$ parameters from the energy measurement step. First, we learn the parameters $\alpha$ to optimize the feature transformers at different levels. Second, conditioned on the found $\alpha$ , we learn the parameters $\beta$ to optimize the intermediate classifiers at different levels. In both the stages, we optimize only for the prediction accuracy independent of the energy. Third, conditioned on the found $\alpha$ and $\beta$ , we find the best values of confidence thresholds $\gamma$ to optimize the objective $\mathcal{O}$ . We describe the details of these three optimization procedures in the following subsections.

IV-B Optimizing Feature Transformers

To obtain the parameters $\alpha$ corresponding to feature transformation functions at different levels, we train the finest (most complex) C2F Net as follows. We minimize the cross-entropy loss over training data $D_{train}$ using a SGD training procedure. Cross-entropy loss (aka log loss) measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. SGD is a stochastic approximation of the gradient descent optimization and is a iterative method for minimizing cross-entropy loss via back-propagation.

IV-C Optimizing Intermediate Classifiers

Our approach to optimize the parameters of intermediate classifiers (i.e., $\beta$ ) is inspired by the transfer learning approach employed in deep learning. The key idea is to leverage the pre-trained model for a source task to quickly learn a model for a relevant target task. In our case, the source task corresponds to learning parameters $\alpha$ for the finest classifier as described above and the target task corresponds to learning parameters $\beta_{i}$ for intermediate classifier at level $i\in\left\{1,2,\cdots,T\right\}$ . Intuitively, we want to minimize the energy consumption as much as possible by making correct classification decisions with high confidence using simple (coarser) classifiers. Therefore, we fix the parameters $\alpha_{1},\alpha_{2},\cdots,\alpha_{i}$ for feature transformation functions and learn the parameters $\beta_{i}$ by minimizing the cross-entropy loss over the training data $D_{train}$ .

IV-D Optimizing Confidence Thresholds

In this section, we describe our approach for finding the best thresholds $\gamma$ =( $\gamma_{1},\gamma_{2},\cdots,\gamma_{T-1}$ ) for the specified trade-off between accuracy and energy consumption of inference with C2F networks.

Problem Formulation. Suppose the domain of each confidence threshold $\gamma_{i}$ at each level $i\in\left\{1,2,\cdots,T-1\right\}$ is $\left[lb,ub\right]$ . For example, if we employ probability of highest scoring label as our confidence parameter, then $lb$ =0 and $ub$ =1. Let $\mathcal{G}$ denote the joint space of candidate confidence threshold assignments, where $g\in\mathcal{G}$ is a candidate assignment to all the $T$ -1 threshold parameters $\gamma_{1},\gamma_{2},\cdots,\gamma_{T-1}$ . We can evaluate the objective $\mathcal{O}(\mathcal{NN},\mathcal{H},\lambda)$ = $\mathbb{E}_{(x,y^{*})\sim\mathcal{D}}$ $\lambda\cdot$ Error( $\mathcal{NN}$ , $\mathcal{H}$ ) + (1- $\lambda)\cdot$ Energy( $\mathcal{NN}$ , $\mathcal{H}$ ) for any $g\in\mathcal{G}$ by running the corresponding C2F net model $\mathcal{NN}$ on the validation set of classification examples $D_{val}$ (i.e., expectation is approximated with a finite sample). Our goal is to find the best confidence thresholds $g^{*}\in\mathcal{G}$ that will lead to the highest objective value $\mathcal{O}(g)$ .

Bayesian Optimization Approach. We propose to solve the above problem using the Bayesian Optimization (BO) framework [28] that is known to be very efficient in solving global optimization problems using black-box evaluations of the objective function (Figure 3). BO algorithms can be seen as sequential decision-making processes that select the next candidate input to be evaluated to quickly direct the search towards optimal inputs by trading-off exploration and exploitation at each search step. By iteratively selecting a candidate input for evaluation and learning a statistical model based on the observed input and output pairs, BO approach can quickly move towards high-quality inputs; this significantly reduces the number of objective function evaluations during the optimization process.

In what follows, we describe the three key elements of the general BO framework as applicable for our problem:

1) A Statistical Model of the true function $\mathcal{O}(g)$ by placing a prior over the space of functions. Gaussian Process (GP) [29] is the most commonly used prior due to its generality and uncertainty quantification ability. A GP over a space $\mathcal{G}$ is a random process from $\mathcal{G}$ to $\mathbb{R}$ . It is characterized by a mean function $\mu:\mathcal{G}\rightarrow\mathbb{R}$ and a covariance or kernel function $\kappa:\mathcal{G}\times\mathcal{G}\rightarrow\mathbb{R}$ . Radial kernels are typically employed as prior covariance functions of GPs. Using a radial kernel means that the prior covariance can be written as $\kappa(g,g^{\prime})=\kappa_{0}\phi(\left\|g-g^{\prime}\right\|)$ and depends only on the distance between $g$ and $g^{\prime}$ . The scale parameter $\kappa_{0}$ captures the magnitude the function could deviate from $\mu$ . $\phi$ is a decreasing function with $\phi(0)=1$ . If a function $F$ is sampled from $\mathcal{GP}(\mu,\kappa)$ , then $\mathcal{O}(g)$ is distributed normally $\mathcal{N}(\mu(g),\kappa(g,g))$ for all $g\in\mathcal{G}$ .

2) An Acquisition Function Af to score the utility of evaluating a candidate input based on the current statistical model. Af need to trade-off exploration and exploitation based on the predictions of the statistical model. Exploitation corresponds to selecting candidate inputs for which the statistical model is highly confident (high mean) and exploration corresponds to selecting candidate inputs for which the statistical model is not confident (high variance). We employ the popular Upper Confidence Bound (UCB) rule as acquisition function. UCB function is defined as $UCB_{t+1}(x)=\mu_{t}(x)+\sqrt{\kappa_{t+1}}\sigma_{t}(x)$ , where $\mu_{t}$ and $\sigma_{t}$ are the posterior mean and standard deviation of the GP conditioned on $\mathcal{D}_{t}$ (Algorithm 5). $\mu_{t}$ encourages exploitation, $\sigma_{t}$ encourages exploration, and $\kappa_{t+1}$ controls the trade-off.

3) An Optimization Procedure to select the the best scoring candidate input according to Af. We employ the popular DIvided RECTangles (DIRECT) approach to select the input $g\in\mathcal{G}$ to be queried in each iteration guided by the statistical model.

In each iteration, BO algorithm calls the optimizer to select the next input $g\in\mathcal{G}$ to evaluate the objective $\mathcal{O}(g)$ , and update the statistical model using the aggregate training data from past evaluations (see Algorithm 5).

V C2F Nets Design Methodology

In this section, we describe our methodology to design C2F networks from a given base network architecture (i.e., complex DNN). Although the methodology is generally applicable to different neural network architectures, we use convolutional neural networks for image classification as a running example for illustrative purposes.

V-A Hardware Setup

All our experiments were performed by deploying DNN models on an ODROID-XU4 board [30]. ODROID-XU4 is an Octa-core heterogeneous multi-processing system with ARM BigLittle architecture, which is very popular in current mobile devices. The BigLittle architecture consists of asymmetrical multi-core system, where cores are clustered into two groups based on the power and performance modes: 1) the power hungry (Big) mode; and 2) the battery saving slower processor mode (Little). The ODROID-XU4 board employs ARM Cortex-A15 Quad 2GHz CPU as the Big Cluster and Cortex-A7 Quad 1.4GHz CPU as the Little Cluster. The board boots up Ubuntu 16.04LTS with ODROID’s Linux kernel version 3.10.y. We execute DNN models with Caffe-HRT [31] that employs Caffe framework [32] with ARM Compute Library to speed up deep learning computations. We employ SmartPower2 [33] to measure power. We compute the average power over the total execution time.

V-B Fine-Grained Energy Analysis of Base DNN

Base deep neural network (DNN) architectures. We consider deep convolutional networks for solving image classification task. These architectures consist of different layers including Convolution, Fully Connected (FC) or Dense, Max Pooling, ReLU, Soft-Max, and Batch-Norm. For ease of understanding, we divide the network into feature transformation block and classifier block. The feature transformer block is the top part of the network and consists of convolution and max-pooling layers. It generates the input features for classifier block. The classifier block is the bottom part of the network and consists of fully connected layers followed by a soft-max layers. It predicts the final classification label for the input image. Refer Figure 2 for further details. We considered four different base DNNs with differing complexity to solve image classification task for our fine-grained energy analysis: Base Network A and C contains six convolution layers and Base Network B and D consists of ten convolution layers (a slight variant of VGG network [34]). However, we present our analysis and results for network A noting that the results are similar for all the other networks.

To understand the influence of a DNN architecture on the energy consumption of a real-world application, we analyzed the energy consumption per layer for all base networks to classify a single image. Our high-level energy analysis shows that convolution layers consume the most energy followed by fully connected layers. Figure 4 shows the distribution of energy consumed by convolution layers, fully connected layers, and all other layers for network A. In what follows, we perform fine-grained analysis for convolution and fully connected layers separately.

Energy analysis of convolution layers. The energy consumed by a convolution layer depends on the following elements: a) the size of the filter, b) image dimensions over which the filter is applied, and c) the number of filters. For all the base networks (A, B, C, and D), the number of convolution filters increases with the depth of the layer. For example, in base network A, the number of filters for increasing depths are 64, 128, and 192 respectively.

To design C2F Nets, we want to construct relatively simpler networks from base network for different trade-offs in accuracy and energy consumption. For these simpler (coarser) networks, we want to reduce the energy consumption by varying the values of the above-mentioned three key elements. We fix the size of the filter to 3x3 across all the layers for all the base networks. Therefore, our design of coarser networks is guided by reducing the number of convolution layers, number of filters per layer, and input image dimension. Figure 5 shows the distribution of energy consumption (normalized with respect to the total energy consumed by the base network) across different convolution layers of the base network A. Therefore, by progressively combining different number of convolution layers, we can generate different coarser networks of progressively increasing energy consumption.

Energy analysis of fully connected layers. Fully connected or dense layers consume the most energy after convolution layers. Dense layer is part of the classifier block of our base DNNs. The classifier block generally has two or more dense layers. Out of them, the first dense layer is usually the largest because it is obtained by flattening the features generated by the immediate convolution layer (just one Max Pooling layer) and is proportional to square of the image width. Hence, the first FC layer consumes significantly more energy than the rest.

V-C Designing C2F Networks

Our goal is to construct networks of varying complexity that show different trade-offs in accuracy and energy to be part of C2F Net.

Convolution layer. From the above discussion, we can construct networks that consume different amounts of energy by combining different number of convolution layers (acts as a feature transformer block for each level $i$ ) followed by a classifier for that level. For example, we construct a C2F Net for base network A with three levels as following. Level 1 (coarsest network) consists of two convolution layers as the feature block followed by a classifier block. Level 2 consists of four convolution layers (two extra convolution layers in addition to those from level 1). Level 3 corresponds to the full base network (finest network). Therefore, with increasing levels, we have progressively more convolution layers resulting in better accuracy at the expense of energy.

Fully connected layer. While designing C2F networks, we add classifiers at different depths of convolution layer for each level. At small depth, the image width is large (due to fewer max-pooling layers) for convolution. Hence, large number of features are fed to the classifier block. This increases the energy consumption of fully connected layers in the classifier blocks for coarser levels. Figure 6 shows the energy consumptions across the different levels of C2F Net corresponding to base network A as described above without an extra max-pooling layer.

We observe that the energy consumed by FC layer of level 1 and level 2 is significantly larger than the energy consumed by the FC layer of Finest Layer (level 3), which is not desirable. Therefore, to match the classifier block energy across different levels, we add an extra max-pooling layer for each of the classifier block for coarser networks (level 1 and level 2) such that the FC input dimension and the corresponding energy is similar across all the levels. Figure6 shows that the energy consumption with the extra max-pooling layer is comparable across all levels.

In summary, a practitioner can employ the above design principles to design coarse-to-fine networks that provide different trade-offs between accuracy and energy consumption.

VI Experiments and Results

In this section, we first describe the experimental setup and then discuss results along different dimensions of optimized C2F Nets.

VI-A Experimental Setup

Hardware Setup. We employ the hardware platform described in Section 5 to perform inference using different networks to measure the accuracy and energy consumption of inference process.

Image classification task and Training data. We consider a image classification task with 10 different classes. We employ the CIFAR10, CIFAR100 and MNIST [35] [36] image dataset with a 4:1:1 split for training (40000), validation (10000), and testing (10000) to train and test our C2F Nets approach. The input image dimension is 32x32x3 for the CIFAR10 and CIFAR100 datasets while it is 32x32x1 for the MNIST data set. The number of classes to be predicted is 10 for CIFAR10 and MNIST while it is 100 for CIFAR100. We employed the CIFAR10 dataset to train our adaptive C2F nets A and B; MNIST dataset for C2F net C; and CIFAR100 dataset for C2F net D

Energy and accuracy trade-off objective $\mathcal{O}$ . Recall that the training of C2F Nets is based on the objective $\mathcal{O}(\mathcal{NN},\mathcal{H},\lambda)$ = $\mathbb{E}_{(x,y^{*})\sim\mathcal{D}}$ $\lambda\cdot$ Error( $\mathcal{NN}$ , $\mathcal{H}$ ) + (1- $\lambda)\cdot$ Energy( $\mathcal{NN}$ , $\mathcal{H}$ ) where $\mathcal{H}$ is the hardware platform and $\mathcal{NN}$ is the C2F Net. Since accuracy and error are complementary, we have presented all our results in terms of accuracy and energy for the ease of exposition. Accuracy is measured as the fraction of input images whose labels are predicted correctly. Energy consumption of C2F Nets is measured in terms of normalized energy delay product (EDP). EDP = $\sum$ P $\Delta$ t $\cdot$ T, where $\Delta$ t is the time interval at which we record the power P and T is the total execution time. As power is measured at a regular interval we simply calculate EDP as EDP = $P_{avg}$ $\cdot$ $T^{2}$ . This value is normalized with EDP of the base network. We vary the value of $\lambda$ from 0 to 1 to get C2F Nets optimized for different trade-offs between accuracy and energy.

C2F Networks. We employ four C2F networks in our experimental evaluation. Figure 7 shows the high-level architecture of C2F Net A and C with three levels and C2F Net B and D with four levels using the notations introduced in Figure 7. With Net A and Net B we demonstrate the effect of different architectures on the same CIFAR10 dataset. With Net C and Net D we analyze the impact on the performance for a simpler and complex dataset like MNIST and CIFAR100 respectively.

Confidence threshold types. We experimented with all three types of confidence computation including Max probability, Margin, and Entropy. We observed that performance is almost same for all the three types. Therefore, we present all our results using Max probability as the confidence type.

Offline training of C2F Nets. Recall that we need to find the parameters $\alpha$ , $\beta$ , and $\gamma$ for the specified trade-off objective using training and validation set of classification examples. For obtaining the parameters $\alpha$ and $\beta$ , we employ the backpropagation training algorithm using RMS prop optimizer with learning rate of 0.0001 and a decay set to $1e^{-6}$ . We employ a batch size of 128 and run training for 200 epochs with sufficient data augmentation including horizontal flips, width and height shifts. For obtaining the confidence thresholds, we employ the Bayesian Optimization (BO) approach with Gaussian kernel and UCB rule as the acquisition function. We use five evaluations of the objective $\mathcal{O}$ randomly for initialization. We run BO until convergence or a maximum of 100 iterations. We train C2F Nets for different values of $\lambda$ to get different trade-offs between energy and accuracy.

VI-B Results and Discussion

In this section, we present the results of our optimized C2F Nets and compare them along different dimensions.

Accuracy and energy of networks at different levels. We train and test the networks at different levels that are part of C2F Net. Specifically, we make all the predictions using a single network for each level of C2F Net. This experiment will characterize the accuracy and energy trade-off achieved by different networks (coarsest to finest) that are part of C2F Net. Figure 8 shows the accuracy and energy metrics for networks at different levels for all the C2F Nets A, B, C and D. C2F Net A: From level 3 to level 2, we can see that accuracy drops by 5% and EDP reduces by 50%. Similarly, from level 2 to level 1, accuracy drops by 15% and EDP reduces by 30% w.r.t level 3. C2F Net B: We achieve an accuracy of 92% when we train and test the base network architecture B (i.e., level 3 network) on CIFAR10 data. We see a 7% drop in accuracy and 85% gain in EDP when we move from level 3 to level 2. C2F Net C and D: We see the trend to be similar to Net A and B respectively. However, we observe higher accuracies with Net C because MNIST data set is simple and obtain lower accuracies in NET D because CIFAR100 data set is relatively complex.

In summary, we see a significant gain in energy with relatively small loss in accuracy when we move from finest to coarsest network in C2F Nets. This corroborates our hypothesis that we can save significant amount of energy if we are able to select coarser networks for large fraction of easy images and complex networks for hard images that are relatively small.

Optimized C2F Net for accuracy of different networks. We saw that fixed networks take significantly more energy for small improvement in accuracy. On the other hand, optimized C2F Nets can potentially reduce the energy consumption to achieve the same amount of accuracy by performing input-specific adaptive inference. To test the effectiveness of adaptive C2F Nets, we trained C2F nets with different $\lambda$ (trade-off) values to find the configurations that achieve the same accuracies as networks at different levels (Figure 9).

C2F Net A: Adaptive C2F Net improves the EDP by 26% and 20% to achieve the same accuracies when compared to the base networks at level 2 (89.8% accuracy) and level 1 (85% accuracy).

C2F Net B: We see significantly more improvement in energy gains when compared to C2F Net A. For example, EDP improves by 60% to achieve 92% accuracy of level 3 (most complex network) and improves by 26% for achieving 85% accuracy of level 2 network.

C2F Net C and D: Similar to the results for Net A and Net B, we observe that EDP improves by 46% and 51% respectively with respect to the base network.

In summary, our approach using adaptive C2F Nets can significantly reduce the energy consumption with negligible loss in accuracy when compared to base networks of different complexity that are part of C2F Net. Additionally, the energy gains increase significantly for complicated base networks (e.g. Net B and D).

Prediction time of adaptive C2F Net vs. finest network. We compare the average execution time to make predictions using adaptive C2F Nets and the finest (most complex) network, where C2F Nets are optimized to achieve the same accuracy as the finest network. Figure 12 shows these results. For network A, the average prediction time of base network and adaptive C2F net are $\sim$ 0.27 secs and $\sim$ 0.20 secs respectively, i.e., 26% reduction in prediction time. Similarly, for network B, the prediction time reduces from $\sim$ 0.82 secs to $\sim$ 0.40 secs leading to 52% reduction. In summary, our adaptive C2F networks perform better in terms of prediction time to achieve the same accuracy as the base network. Additionally, the improvement is much more for complex networks similar to energy gain results.

Fine-grained analysis of adaptive C2F Nets. We demonstrated that adaptive C2F nets can improve both energy and prediction time when compared to the finest network through the above presented results. We perform fine-grained analysis to understand how adaptive C2F nets achieve these gains. We present this analysis for C2F Net A noting that analysis for other C2F nets show similar trends.

C2F Net A was optimized for achieving the same accuracy (89.8%) as the finest network. The energy gains can be explained by understanding how many images out of the 10000 image testing set are classified at different levels. At level 1, 263 images are classified with 100% accuracy. At level 2, 4911 images are classified with with 98.7% accuracy. At level 3, 4826 images are classified with 80.2% accuracy. Therefore, 2%, 48%, and 49% of the total testing images are classified at levels 1, 2, and 3 respectively. From previous results, we see that the accuracy of 89.8% is obtained using 76% of the energy consumed by the finest network. Levels 1, 2, and 3 contribute 0.46%, 24.5% and 48.26% to this 76% respectively. Similarly, for C2F Net B with four levels, we see 0, 1339, 5517, and 3144 images classified with accuracies of 0, 99.7%, 97.17% and 79.73% and energy consumption of 0.0, 0.49%, 7.7% and 31.4% respectively. In both cases, more than 50% of the images were predicted by a level other than the last level. Additionally, accuracy of the lower levels is closer to 100%. Therefore, we see huge improvements in energy and prediction time for adaptive C2F Nets.

Qualitative results. Figure 10 shows some sample images that are predicted at level 1 (coarsest) and level 3 (finest) using adaptive C2F Net A. We make following observations: 1) Images classified using level 1 are simpler with a single object and clear background; 2) Images classified using level 2 have overlapping or hidden objects with confusing backgrounds. These results demonstrate how adaptive C2F nets use simpler classifiers for easy images and complex classifiers for hard images.

Single threshold vs. Multiple threshold. To measure the quantitative difference between using single threshold for all levels of C2F net (as in CDL [5]) and different thresholds for all levels, we compare their pareto-optimal curves. Figure 13 shows the comparative results for adaptive C2F networks B and D. We make the following observations. When the number of levels are more, different thresholds for different levels achieve better pareto-optimal solutions when compared to the setting with single threshold for all levels. Our experimental results on CIFAR10 (Net B) and CIFAR100 (Net D) clearly show this difference. We also present the pareto-optimal solutions we obtained for all adaptive C2F Nets A, B, C and D in Figure 12 for completeness.

Behavior of optimized C2F Nets with different $\lambda$ . We can vary the trade-off parameter to obtain optimized C2F nets for different accuracy and energy trade-offs. When $\lambda$ is close to zero, there is more emphasis on saving energy without paying attention to accuracy. Similarly, when $\lambda$ is close to one, adaptive C2F nets try to achieve best possible accuracy by minimizing the overall energy consumption.

We study the behavior of optimized C2F nets with different values of $\lambda$ in terms of the number of images predicted at different levels, the prediction accuracy at different levels, and the average prediction time. Figure 14 shows these results for C2F Net A noting that results are similar for other C2F nets.

We make following observations from Figure 14(a). When $\lambda$ =0, all the images are predicted at level 1 as coarsest network consumes the least energy. As the $\lambda$ value increases, the number of images predicted by level 2 and level 3 slowly increases to achieve higher accuracy. Note that, even with the highest $\lambda$ value, not all images are predicted by level 2 (finest network) to optimize the energy consumption to achieve high accuracy.

From Figure 14(b), we can see that as $\lambda$ increases, the prediction accuracy of level 1 and level 2 increases proportionally and reach almost 100% accuracy. The overall accuracy of C2F net is primarily determined by the accuracy of level 2 classifier.

From Figure 14(c), we observe that the average prediction time gradually increases with $\lambda$ , which is explained by more and more images getting predicted at higher levels. When compared to the base (finest) network, the prediction time at $\lambda$ =0 is 63% lesser and at highest $\lambda$ value, we get around 26% gain. Therefore, in scenarios where we require real-time predictions, we can optimize the accuracy of C2F nets for the target time constraints.

VII Conclusions and Future Work

Motivated by the challenges associated with performing inference using deep neural networks on resource-constrained embedded systems, we studied a co-design approach based on the formalism of coarse-to-fine network (C2F Net) that allows us to employ classifiers of varying complexity depending on the hardness of input examples. Our proposed optimization algorithm to automatically configure a C2F Net for a specified trade-off between accuracy and energy consumption on a target hardware platform is very effective. Results on four different C2F Nets for image classification show that using optimized C2F Net we can significantly reduce the overall energy with no loss in accuracy when compared to a baseline solution, where predictions for all input examples are made using the most complex network. Though we have demonstrated the method for four nets this idea could be extended to other architectures like MobileNets [18]. Future directions include evaluation of our framework on deep networks such as MobileNets [18], studying automated approaches for C2F nets design, heterogeneous architectures for embedded systems in the context of inference using deep networks, machine learning based approaches for software and hardware co-design using C2F Nets, and dynamic power management to further improve the overall energy-efficiency.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Le Cun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature , vol. 521, no. 7553, pp. 436–444, 2015.
2[2] K. A. et al., “Deep reinforcement learning: A brief survey,” IEEE Signal Process. Mag. , vol. 34, no. 6, pp. 26–38, 2017.
3[3] D. S. et al., “Mastering the game of go with deep neural networks and tree search,” Nature , vol. 529, no. 7587, pp. 484–489, 2016.
4[4] F. Fleuret and D. Geman, “Coarse-to-fine face detection,” in International Journal of Computer Vision , vol. 41, no. 1/2, 2001, pp. 85–107.
5[5] P. P. et al., “Conditional deep learning for energy-efficient and enhanced pattern recognition,” in Proceeding of DATE , 2016.
6[6] V. S. et al., “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE , vol. 105, no. 12, pp. 2295–2329, 2017.
7[7] Y. Chen, J. S. Emer, and V. Sze, “Using dataflow to optimize energy efficiency of deep neural network accelerators,” IEEE Micro , 2017.
8[8] M. G. et al., “Tetris: Scalable and efficient neural network acceleration with 3d memory,” SIGARCH Computer Architecture News , 2017.