Energy Efficient Hardware for On-Device CNN Inference via Transfer Learning
Paul Whatmough, Chuteng Zhou, Patrick Hansen, Matthew Mattina

TL;DR
This paper introduces FixyNN, a hardware platform that combines fixed-weight feature extractors with programmable classifiers, significantly improving energy efficiency for on-device CNN inference through transfer learning.
Contribution
The paper presents a novel co-designed hardware and transfer learning approach that splits CNNs into fixed front-end and adaptable back-end layers for energy-efficient inference.
Findings
Nearly 2x energy efficiency improvement over conventional accelerators
Maintains <1% accuracy loss across six datasets
Effective transfer learning with fixed front-end layers
Abstract
On-device CNN inference for real-time computer vision applications can result in computational demands that far exceed the energy budgets of mobile devices. This paper proposes FixyNN, a co-designed hardware accelerator platform which splits a CNN model into two parts: a set of layers that are fixed in the hardware platform as a front-end fixed-weight feature extractor, and the remaining layers which become a back-end classifier running on a conventional programmable CNN accelerator. The common front-end provides ubiquitous CNN features for all FixyNN models, while the back-end is programmable and specific to a given dataset. Image classification models for FixyNN are trained end-to-end via transfer learning, with front-end layers fixed for the shared feature extractor, and back-end layers fine-tuned for a specific task. Over a suite of six datasets, we trained models via transfer…
| FixyNN configuration | Accuracy (%) | Throughput | Energy efficiency | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Fixed layers | Fixed Ops (%) | CIFAR100 | CIFAR10 | SVHN | Flwr | Airc | GTSR | TOPS | Relative | TOPS/W | Relative |
| 0 | 0.0 | 72.8 | 93.5 | 95.8 | 88.1 | 67.7 | 97.7 | 1.91 | 1.00 | 5.58 | 1.00 |
| 4 | 27.1 | 72.5 | 93.3 | 95.7 | 88.3 | 66.7 | 97.8 | 2.32 | 1.21 | 7.98 | 1.43 |
| 7 | 44.3 | 72.0 | 92.7 | 95.8 | 87.5 | 64.0 | 95.0 | 2.62 | 1.37 | 10.78 | 1.93 |
| 11 | 77.0 | 71.1 | 91.7 | 94.6 | 86.9 | 56.7 | 89.2 | 3.18 | 1.66 | 25.86 | 4.63 |
| 14 | 97.0 | 68.5 | 85.3 | 91.0 | 82.8 | 41.9 | 59.3 | ||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing
Energy Efficient Hardware for On-Device CNN Inference via Transfer Learning
Paul Whatmough Chuteng Zhou Patrick Hansen Matthew Mattina
Arm Research
Boston, MA
{Paul.Whatmough,Chu.Zhou,Patrick.Hansen,Matthew.Mattina}@arm.com
Abstract
On-device CNN inference for real-time computer vision applications can result in computational demands that far exceed the energy budgets of mobile devices. This paper proposes FixyNN, a co-designed hardware accelerator platform which splits a CNN model into two parts: a set of layers that are fixed in the hardware platform as a front-end fixed-weight feature extractor, and the remaining layers which become a back-end classifier running on a conventional programmable CNN accelerator. The common front-end provides ubiquitous CNN features for all FixyNN models, while the back-end is programmable and specific to a given dataset. Image classification models for FixyNN are trained end-to-end via transfer learning, with front-end layers fixed for the shared feature extractor, and back-end layers fine-tuned for a specific task. Over a suite of six datasets, we trained models via transfer learning with an accuracy loss of , resulting in a FixyNN hardware platform with nearly better energy efficiency than a conventional programmable CNN accelerator of the same silicon area (i.e. hardware cost).
1 Introduction
Emerging applications such as augmented/mixed reality, autonomous drones and automotive driver assistance demand on-device computer vision (CV) features, such as image classification, object detection/tracking, and semantic segmentation. In support of these applications, we’ve seen a marked increase in accuracy on such CV tasks in recent years, following the displacement of traditional hand-crafted feature extractors by convolutional neural networks (CNNs) [1]. However, CNNs pose a number of challenges for on-device inference due to a vast increase in the amount of computation and storage required [1], which must be met by the hardware platform. Unfortunately, mobile device hardware resources are heavily constrained in terms of both energy consumption and also the silicon area of the system-on-chip (SoC) inside the device. Therefore, a gap in energy efficiency exists between the demands of real-time CV CNNs, and the power constraints of mobile devices. This gap is severely compounded at high image resolution and frame rate (e.g. 1080p at 30FPS).
In this paper we describe FixyNN, which builds upon two key trends in on-device ML: more compact CNN architectures [2] and energy efficient CNN hardware accelerators [3, 4]. Section 2 gives an overview of the FixyNN architecture, Section 3 presents experimental results, and Section 4 provides concluding remarks.
2 FixyNN Overview
FixyNN is a CNN model architecture co-designed with the hardware platform and trained using transfer learning principles. The general approach illustrated in Figure 1 is to divide a given model into a shared front-end fixed-weight feature extractor (FFE) and a task-specific back-end classifier. The FFE implements a fixed set of CNN layers that are common to all models, and is implemented as a heavily-optimized fixed-function hardware accelerator – essentially an embodiment of “do one thing and do it well”. The FFE is fixed in hardware and used for all FixyNN models, and therefore the layers used are taken from a model trained on a large dataset such as ImageNet, to learn features that generalize well across a range of datasets. The weights for the FFE are fixed in the hardware and do not require access to DRAM memory. The back-end classifier 111The terms feature extractor and classifier are used very loosely here, merely to distinguish between the layers towards the front of the network, and the remaining layers up to the end. is unique for each model, and is therefore implemented on a canonical programmable CNN hardware accelerator [4, 3], or could even be implemented using the mobile CPU or GPU. The weights for the back-end classifier are stored in DRAM memory.
In Section 3, we demonstrate significant throughput and energy efficiency advantage from the FixyNN hardware platform. These gains are a result of diverting a significant portion of the computational load of a given CNN to the heavily-specialized FFE hardware accelerator. The FFE can be heavily optimized because all the front-end CNN layers associated with the FFE are known and fixed at the time of designing the hardware platform. Hence, we can aggressively exploit unstructured weight sparsity and other optimizations which typically offer little advantage to programmable hardware. Usually, this kind of aggressive hardware specialization has limited utility as the FFE is essentially a fixed-function accelerator and only implement a set of fixed-weight layers. However, in the context of deep CNNs, it is well known that through transfer learning [5], we are typically able to train new models that incorporate a set of front-end layers from a model trained on a different dataset. Therefore, FixyNN explores an opportunity to aggressively exploit hardware specialization, without loosing the ability to generalize to a range of CV tasks.
3 Experimental Results
Focusing on image classification problems, we evaluate FixyNN using experiments based on the MobileNetV1 [2] CNN architecture, and focus on the compact MobileNet-0.25, which is well-suited to on-device inference applications. We evaluate both the hardware throughput and energy efficiency, as well as the model accuracy across a range of image classification datasets.
3.1 Hardware Evaluation
To implement the FFE hardware accelerator, we designed a tool called DeepFreeze [6], which consumes a model described in a high-level framework such as TensorFlow, and generates a hardware description in the Verilog language. DeepFreeze generates fully-parallel, fully-pipelined hardware, with optimizations for exploiting unstructured weight sparsity, and fine-grained quantization, as well as optimized storage which does not require DRAM access. Since there are a variety of possible hardware configurations for FixyNN, with varying silicon area costs, we model the throughput and energy efficiency for these configurations 222Throughput is reported in Terra-Operations Per Second (TOPS), and energy efficiency in TOPS/W. . The FFE running the front-end model is modeled using hardware synthesis from a register-transfer level description, from which we can simulate clock frequency and power consumption. The programmable back-end is modeled using previously published NVDLA data. In all cases, we compare FixyNN with a baseline of a fully-programmable model running on NVDLA, which is the current state-of-the-art for mobile devices.
Figure 2 illustrates the throughput and energy efficiency trade-offs of fixing an increasing number of layers of a network in hardware at different hardware silicon area costs. The results suggest that at areas greater than 2mm2, it is beneficial in both performance and energy efficiency to invest area to fix some layers of a network with the FFE, rather than dedicating that area to a larger programmable accelerator (NVDLA). However, it is inefficient to invest in an FFE at very low area budgets because even a small number of fixed layers consume a high percentage of the total area budget, causing a smaller programmable accelerator to bottleneck the system.
3.2 ML Evaluation
FixyNN utilizes transfer learning concepts to train models for various datasets that take advantage of the single, shared FFE front-end. Similar to [5], each model is pretrained on ImageNet and the back-end layers are fine-tuned on each target dataset. However, we do enable training of batch norm (BN) parameters across all layers (i.e. including the layers in the FFE), which we found to dramatically improve performance with a negligible area and power increase in the FFE. Table 1 summarizes the accuracies for our transfer learning experiments with MobileNet-, where the model is pretrained on ImageNet to a top-1 accuracy of and then transferred to CIFAR-10, CIFAR-100, Street View House Numbers (SVHN), Flowers102 (Flwr), FGVC-Aircraft (Airc), and German Traffic Sign Recognition benchmark (GTSR). The first row (our baseline) shows the performance of the model fully fine-tuned to the new tasks. As more layers are fixed in the network, a bigger FFE is used. At area budget, hardware performance can significantly benefit from the bigger FFE. However, deeper layers are associated with more task specific features. Fixing more layers in the FFE generally results in loss of model accuracy. For datasets CIFAR-10, CIFAR-100, SVHN and Flwr, of the network can be fixed with less than loss in model accuracy. For datasets Airc and GTSR, similar accuracy performance relative to the baseline requires fixing a smaller percentage of the network in the FFE (between and ).
3.3 Discussion
Our experiments suggest that dedicating some percentage of the hardware platform to a fixed-weight feature extractor provides significant performance and power benefits when the area devoted to CNN vision tasks is greater than 2mm2. Fixing more layers of a network provides better hardware performance by diverting computational load from the programmable accelerator to the highly-efficient FFE. However, as more layers are fixed, the task of training a new network incorporating the FFE on a different dataset becomes more challenging, and tends to incur an accuracy loss. Concretely, we found that fixing 7 layers of MobileNet-0.25 in hardware, we have shown that FixyNN achieves 1.37 and 1.93 better performance and energy efficiency, respectively, with accuracy loss, compared to a traditional programmable accelerator. Since two tasks (Airc, GTSR) incur a potentially unacceptable accuracy loss with 7 fixed layers, we propose to modify FixyNN to provide access to output activations from an earlier layer in the FFE, such that a model does not have to use all the fixed layers in the FFE. With this modification, models for Airc and GTSR can use just 4 layers of the 7 layer FFE, resulting in a 1.04 and 1.49 improvement in performance and energy efficiency for these datasets, while still achieving the 1.37 and 1.93 improvement for the remainder. All tasks incur 1% absolute accuracy loss in this configuration.
4 Conclusion
In this paper, we evaluate FixyNN, which proposes to split a CNN model into two components: a common front-end which we compute in a heavily optimized fixed-weight feature extractor (FFE), and a programmable back-end implemented on a conventional programmable hardware accelerator. This combination allows us to take advantage of aggressive hardware specialization for the front-end, but retain generalization to a range of datasets by training the programmable back-end using transfer-learning principles. Experimental results show that FixyNN provides nearly improvement in on-device CNN energy efficiency, with an accuracy degradation of . This is a significant step forward towards the goal of real-time, on-device CNN inference. While fixing even more layers would result in higher hardware throughput and energy efficiency, it can also lead to prediction accuracy degradation. Therefore, balancing the number of fixed layers is crucial in FixyNN hardware platforms. Finally, we note that future research progress with transfer learning research is likely to further strengthen the case for hardware specialization for CNNs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Amr Suleiman, Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision. In Proc. of ISCAS , 2017.
- 2[2] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Co RR , abs/1704.04861, 2017.
- 3[3] Arm ML. Arm Machine Learning Processor. URL https://developer.arm.com/products/processors/machine-learning/arm-ml-processor .
- 4[4] NVDLA. Nvidia Deep Learning Accelerator (NVDLA). URL http://nvdla.org/primer.html .
- 5[5] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27 . 2014.
- 6[6] Deep Freeze. RTL generation tool for CN Ns. URL https://github.com/ARM-software/Deep Freeze .
