Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators
Wenyong Zhou, Zhengwu Liu, Yuan Ren, Ngai Wong

TL;DR
This paper proposes a novel binary weight multi-bit activation quantization method for CNNs on compute-in-memory accelerators, improving accuracy while maintaining hardware efficiency.
Contribution
It introduces closed-form weight quantization solutions and a differentiable activation quantization function, enhancing binarized weights and multi-bit activations for CIM platforms.
Findings
Achieves 1.44%-5.46% accuracy gain on CIFAR-10 and ImageNet
4-bit activation quantization offers optimal hardware-performance balance
Significantly improves binarized weight representational capacity
Abstract
Compute-in-memory (CIM) accelerators have emerged as a promising way for enhancing the energy efficiency of convolutional neural networks (CNNs). Deploying CNNs on CIM platforms generally requires quantization of network weights and activations to meet hardware constraints. However, existing approaches either prioritize hardware efficiency with binary weight and activation quantization at the cost of accuracy, or utilize multi-bit weights and activations for greater accuracy but limited efficiency. In this paper, we introduce a novel binary weight multi-bit activation (BWMA) method for CNNs on CIM-based accelerators. Our contributions include: deriving closed-form solutions for weight quantization in each layer, significantly improving the representational capabilities of binarized weights; and developing a differentiable function for activation quantization, approximating the ideal…
| Models | Size | Clean | Baseline | Ours |
| Mamba | 3 bits | 86.21% | 1.00 / 1.00 | 1.00 / 1.00 |
| 4 bits | 87.06% | 1.16 / 1.19 | 1.14 / 1.19 | |
| 5 bits | 87.15% | 1.32 / 1.39 | 1.28 / 1.37 | |
| 6 bits | 87.37% | 1.48 / 1.59 | 1.42 / 1.57 | |
| Mamba2 | 3 bits | 86.53% | 1.00 / 1.00 | 1.00 / 1.00 |
| 4 bits | 87.35% | 1.16 / 1.20 | 1.14 / 1.19 | |
| 5 bits | 87.59% | 1.32 / 1.39 | 1.28 / 1.38 | |
| 6 bits | 87.68% | 1.48 / 1.59 | 1.43 / 1.57 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Big Data and Digital Economy
Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators
Wenyong Zhou, Zhengwu Liu*, Yuan Ren, and Ngai Wong This research was partially conducted by ACCESS – AI Chip Center for Emerging Smart Systems, supported by the InnoHK initiative of the Innovation and Technology Commission of the Hong Kong Special Administrative Region Government, and partially supported by the Theme-based Research Scheme (TRS) project T45-701/22-R of the Research Grants Council (RGC), Hong Kong SAR, and the National Natural Science Foundation of China Project 62404187. All authors are with the Department of Electrical and Electronic Engineering, The University of Hong Kong. *Corresponding authors: Zhengwu Liu, Ngai Wong.
Abstract
Compute-in-memory (CIM) accelerators have emerged as a promising way for enhancing the energy efficiency of convolutional neural networks (CNNs). Deploying CNNs on CIM platforms generally requires quantization of network weights and activations to meet hardware constraints. However, existing approaches either prioritize hardware efficiency with binary weight and activation quantization at the cost of accuracy, or utilize multi-bit weights and activations for greater accuracy but limited efficiency. In this paper, we introduce a novel binary weight multi-bit activation (BWMA) method for CNNs on CIM-based accelerators. Our contributions include: deriving closed-form solutions for weight quantization in each layer, significantly improving the representational capabilities of binarized weights; and developing a differentiable function for activation quantization, approximating the ideal multi-bit function while bypassing the extensive search for optimal settings. Through comprehensive experiments on CIFAR-10 and ImageNet datasets, we show that BWMA achieves notable accuracy improvements over existing methods, registering gains of 1.44%-5.46% and 0.35%-5.37% on respective datasets. Moreover, hardware simulation results indicate that 4-bit activation quantization strikes the optimal balance between hardware cost and model performance.
Index Terms:
Compute-in-Memory, Model Qquantization, SRAM, RRAM, FeFET
I Introduction
Convolutional Neural Networks (CNNs) are pivotal in computer vision tasks, yet their growing complexity necessitates more computational power and energy [1, 2]. Conventional digital circuits for CNN inference, such as central processing units (CPUs) and graphics processing units (GPUs), struggle with data movement inefficiencies due to their von Neumann architecture with separate memory and processing units [3, 4]. The compute-in-memory (CIM) architecture, which integrates processing and memory, offers a solution by largely reducing data movement [5, 6]. Utilizing static random-access memory (SRAM), resistive random-access memory (RRAM), or ferroelectric field-effect transistor (FeFET), CIM-based accelerators enhance energy efficiency during CNN’s most demanding task, namely, matrix-vector multiplications (MVMs), by executing in the analog domain within crossbar arrays and leveraging analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for data conversion [7].
While CIM architectures offer promising solutions for CNN acceleration, they face practical implementation challenges due to hardware constraints and analog computing non-idealities. Model quantization, which reduces the precision of weights and activations, emerges as a crucial technique to address these challenges [8, 9]. However, quantization for CIM architectures requires special consideration. In CIM designs, weight precision directly impacts the area and power consumption of memory cells, while activation precision determines ADC/DAC complexity and energy cost [3].
CIM-oriented quantization approaches have been developed utilizing two main strategies [10, 11, 12, 13]. The first strategy, such as CIM-BNN [14] and XNOR-RRAM [13], focuses on extremely quantized networks with binary weights and binary activations (BWBA, see Fig. 1(a)) for efficient hardware implementations by replacing multiply-and-accumulate operations with XNOR and bit counting operations [13]. However, the strategy results in significant accuracy loss due to the radical quantization. The second strategy, such as CIMQ [15], employs multi-bit weights and multi-bit activations (MWMA, see Fig. 1(b)) to maintain higher accuracy. Nevertheless, MWMA approaches face practical limitations: multi-bit weights increase storage overhead in CIM devices, while higher activation bitwidth impacts ADC costs and peripheral circuitry. Consequently, achieving both high accuracy and hardware efficiency remains challenging in CIM-oriented quantized networks.
To address these challenges, we propose a novel binary weight multi-bit activation (BWMA) quantization approach for CIM-based accelerators, as illustrated in Fig. 1(c). The contributions of this paper are multifaceted:
- •
Unlike prior hardware-agnostic approaches, we propose a quantization framework that considers CIM’s mixed-signal constraints by optimizing bitwidth based on cell precision and data converter resolution.
- •
We enhance model representation through two key innovations: a closed-form layer-specific weight binarization method and an efficient differentiable function for uniform multi-bit quantization, eliminating the need for exhaustive parameter search.
- •
Extensive evaluations validate our approach: achieving 0.35-5.46% accuracy improvements on CIFAR-10 and ImageNet datasets, while hardware simulations across various CNN architectures reveal 4-bit data converters as the optimal trade-off between hardware cost and model performance.
II Related Work
Model quantization has emerged as a crucial technique for deep neural network deployment, with research spanning from binary to multi-bit precision schemes. Early binary neural networks demonstrated promising hardware efficiency, with FINN [16] providing a scalable framework for BNN inference. To address the accuracy challenges of binary networks, Tang et al. [9] introduced specialized training methods for compact binary networks, while ReactNet [8] developed generalized activation functions to enhance BNN expressivity. Martinez et al. [17] bridged full-precision and binary network training through real-to-binary convolutions, and balanced binary neural networks [18] improved information flow using gated residual mechanisms. Beyond binary quantization, PACT [19] proposed parameterized clipping activations for flexible multi-bit quantization, enabling learnable quantization ranges that better preserve network accuracy. However, these quantization approaches primarily target conventional digital hardware, leaving room for specialized solutions for CIM architectures with unique characteristics and constraints.
There are many CIM-based CNN implementations. For example, ISAAC [3] explores an in-situ processing approach, where memristor crossbar arrays not only store input weights but also perform dot-product operations in the analog domain. IMCE [4] employs parallel computational memory sub-arrays as fundamental units for bit-wise in-memory convolution operations. While both ISAAC and IMCE focus on hardware architecture design, our work emphasizes algorithm-level optimization and provides comprehensive evaluation to bridge the gap between network quantization and practical CIM implementation.
III Methodology
III-A Quantization on CIM Accelerators
Due to design complexity and reliability concerns in various CIM devices (e.g., SRAM, RRAM, FeFET), we assume each cell in CIM crossbar arrays stores a 1-bit value, though different technologies demonstrate varied bit-width capabilities. While traditional BNNs [8, 9] constrain weights to fixed binary values (1), limiting model representation across diverse CNN layers, CIM-based accelerators enable adaptive binary sets through layer-specific scaling factors. The weight-to-conductance mapping in CIM can be expressed as:
[TABLE]
where and are conductance bounds, and scaling factors and align weight and conductance ranges. This mapping enables expanded weight representation through layer-specific binary values while maintaining hardware simplicity. The mixed-signal nature of CIM architectures necessitates multi-bit ADCs to convert analog matrix-vector multiplication outputs for digital processing. This hardware characteristic motivates our BWMA strategy - employing binary weights for efficient crossbar operations while maintaining higher precision in activations to match ADC resolution. Our BWMA framework jointly optimizes both components: it determines layer-specific binary weights by preserving statistical distributions (mean and standard deviation), while aligning activation quantization with ADC characteristics through a piecewise differentiable approximation.
III-B BWMA Quantization
Our BWMA framework employs quantization-aware training (QAT) to achieve optimal performance. During training, we maintain both full-precision and quantized representations of weights and activations. The binary weight values are determined through moment matching, while activations are quantized to multiple bits using our proposed differentiable approximation function.
For the model weights in each layer, weight binarization maps them to two distinct values and , as shown in Fig. 2(a). Minimizing Kullback–Leibler (KL) divergence for binarization is intuitive but can be suboptimal due to calculation difficulties with uncertain priors [20]. Here, we propose aligning the first and second moments of distributions as an alternative to KL divergence for weight binarization. Considering that weights in CNNs generally follow a symmetric distribution [20], we partition the weights through the median and evenly separate the weights as shown in Fig. 2(a). By mapping the left half to and the right half , our method is formulated as:
[TABLE]
[TABLE]
where calculates the expected value (i.e. mean) of . For ease of computation, we rewrite the binary values as and , where is the midpoint between and , and is a positive number representing the distance between () and .
[TABLE]
The terms on the right-hand side of Eq. 4 are the mean () and standard deviation () of original full-precision weights, respectively.
Our BWMA framework achieves distribution alignment through moment matching. Specifically, for each layer during the forward pass, we compute the mean and standard deviation of the original weights . These statistics determine two binary values and . To handle the non-differentiable binarization operation in the backward pass, we propose a modified straight-through estimator (STE):
[TABLE]
where is a temperature parameter that controls the steepness of gradient approximation, and is a scaling factor.
Unlike the closed-form solution available for weight binarization, finding an analytical solution for activation quantization is infeasible due to the complex and asymmetrical distributions of activation values. This makes moment alignment ineffective in multi-bit scenarios. Consider the multi-bit uniform quantization defined as follows:
[TABLE]
where and are the minimum and maximum values of , respectively, and represents the length of uniform intervals. The function returns the nearest integer. The non-differentiable complicates the update of activations during backpropagation.
An empirical solution to this problem involves using a STE, which comprises differentiable functions [21, 22]. The function, consisting of scaled and shifted functions as illustrated in Fig. 2(b), is approximated by a novel differentiable function aimed at simulating the function. Historically, the derivative of function, aka Dirac function, is estimated using either a rectangular [21] or triangular [22] function (both computationally efficient but imprecise), or a piecewise function [20] which offers greater accuracy but is computationally burdensome. To overcome these, we introduce a quadratic function as a differentiable approximation of the Dirac function:
[TABLE]
The quadratic form of , akin to a bell-shaped curve, provides a superior approximation compared to constant or linear approaches, while also being computationally more efficient than complex piecewise functions. Integration of results in that approximates the function:
[TABLE]
By scaling and shifting , namely, where and denote the scale and the center of each interval, respectively, we approximate the function effectively.
IV Experiments
IV-A Experiment Setup
All experiments are conducted on NVIDIA RTX 3090 GPUs with 24 GB VRAM. The hardware simulations are fully dependent on the DNN+NeuroSim platform [7], which supports VGG and ResNet models across SRAM, RRAM, and FeFET devices. As shown in Fig. 3, DNN+NeuroSim leverages an H-tree routing architecture to efficiently manage data movement across different hierarchical levels. The routing infrastructure spans from chip-level interconnects down to individual PE arrays, facilitating communication between computational tiles, global buffers, and functional units. Moreover, the framework incorporates an optimized spatial mapping scheme for convolutional layers that partitions kernels based on their spatial locations into sub-matrices, reducing buffer access requirements and improving data reuse efficiency. For linear layers, the framework employs a conventional mapping approach by unrolling weights into column vectors for matrix multiplication operations.
IV-B Experiment Results
Table I compares the accuracy of ResNet-18 on CIFAR-10 and ImageNet datasets with our method with existing quantization schemes. The CIM-oriented MWMA methods quantize different parts of models, including weights [11], activations [12], or both. The mixed precision of model parameters leads to the non-integers when calculating the average bitwidth of weights and activations. The CIM-oriented BWBA method [13] emphasizes their hardware implementation on RRAM-based CIM-based accelerators. Compared to previous CIM-oriented quantization methods, our BWMA approach with binary weights and 4-bit activation, an optimal balance between hardware cost and accuracy (explained later), improves the accuracy by 1.44%-5.46% and 0.35%-5.37% on the CIFAR-10 and ImageNet datasets, respectively.
Table II compares our method with the vanilla counterparts of BDenseNet [23], MeliusNet [24], and ReActNet [8]. Since these models are binary-specific structures, we set the activation to 1-bit. For BDenseNet28, our method achieves a 0.9% improvement under the same training settings with negligible computational overhead. Similarly, when applied to MeliusNet and ReActNet-A structures, our method brings consistent performance gains.
For hardware metrics, we simulate latency, chip area, and energy consumption of VGG-8 and ResNet-20 on the CIFAR-10 dataset. The performance are reported in Figures 4, 5 & 6. Fig. 4 depicts the total latency and breakdown of different components of VGG-8 and ResNet-20 on different crossbar sizes, taking RRAM-based accelerators as an example. According to the simulation results, the ADCs contribute roughly 14% in total latency, while the accumulation circuits (at PE and tile levels) and other peripheral circuits are the primary consumers (accounting for 23% and 63%). Although the absolute value significantly varies for different CNNs, the relative relationship between multiple components is stable, indicating the similarity of CIM-based accelerators for accelerating CNNs. Increasing crossbar size brings lower latency, which coincides with the trend in Fig. 6, demonstrating that larger crossbars could improve hardware efficiency.
Fig. 5 compares the impact of crossbar size and model architecture on RRAM-based accelerators regarding chip area. It is clear that the CIM array only takes a small amount in the chip area, ranging from 1% to 15%. On the one hand, those cells in crossbar arrays in SRAM-based accelerators are much larger than those in RRAM/FeFET-based accelerators. Unlike SRAM, which comprises 6 transistors in one cell, the typical configuration in an RRAM cell is one transistor and one RRAM device (1T1R). On the other hand, other peripheral circuits take a smaller portion in SRAM-based accelerators than in RRAM/FeFET-based accelerators, making the total chip areas differ slightly across different device types. Together with a negligible contribution of CIM arrays on latency and energy consumption, optimizing the data conversion process and peripheral circuits play a vital role in reducing the hardware cost of CIM-based accelerators.
Fig. 6 compares the impact of crossbar size and device type on energy consumption in various quantization scenarios. From the figure, SRAM-based accelerators consume 57% to 73% of energy than emerging devices due to the lower working voltage. The higher working voltage of FeFET-based accelerators leads to more energy consumption than RRAM-based accelerators. Although larger crossbars are typically more hardware efficient, the expected descending trend is not observed in both VGG-8 and ResNet-20, mainly due to the occupation of the unused cells increased (from 6.5% to 30.4% and from 10.7% to 39.1% for VGG-8 and ResNet-20, respectively) when opting for larger crossbars. Therefore, maintaining a high resource utilization should be a key concern in determining suitable crossbar size.
Our multi-bit activation quantization enables flexible accuracy-hardware trade-offs. To quantify these trade-offs, we analyze hardware costs across various data converter resolutions (Table LABEL:tab:avghwcost), normalizing latency, area, and energy metrics against 3-bit converters. Although SRAM- and RRAM-based implementations exhibit distinct absolute performance characteristics, their normalized costs follow similar trends with increasing resolution. Higher-resolution converters (5-/6-bit) introduce substantial hardware overhead without commensurate accuracy gains, while 4-bit precision consistently emerges as the optimal balance point across all device types, architectures, and crossbar sizes. This finding demonstrates that our BWMA framework not only enhances model accuracy but also provides clear guidelines for hardware-efficient CIM accelerator design.
V Conclusion
This paper has proposed a hardware-aware quantization framework for CIM-based accelerators that optimizes both cell precision and data converter resolution. Our key innovations include analytically-derived layer-specific weight binarization through moment matching, and an efficient differentiable approximation for uniform multi-bit quantization. Experiments demonstrate superior accuracy with 1.44%-5.46% and 0.35%-5.37% improvements on CIFAR-10 and ImageNet respectively, while hardware simulations across SRAM, RRAM, and FeFET implementations identify 4-bit data converters as the optimal balance between cost and performance.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] K. He et al. , “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016.
- 2[2] K. Simonyan et al. , “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Proceedings of the International Conference on Learning Representations , 2015.
- 3[3] A. Shafiee et al. , “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” in Proceedings of the 43rd International Symposium on Computer Architecture , 2016.
- 4[4] S. Angizi et al. , “IMCE: Energy-Efficient Bit-Wise In-Memory Convolution Engine for Deep Neural Network,” in 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC) , 2018.
- 5[5] W. Zhou et al. , “Towards Robust RRAM-Based Vision Transformer Models with Noise-Aware Knowledge Distillation,” in 2025 Design, Automation & Test in Europe Conference (DATE) , 2025.
- 6[6] W. Zhou et al. , “RRAM-Based Isotropic CN Ns with High Robustness and Resource Utilization Rate,” in 2025 9th IEEE Electron Devices Technology & Manufacturing Conference (EDTM) , 2025.
- 7[7] X. Peng et al. , “DNN+Neuro Sim V 2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2021.
- 8[8] Z. Liu et al. , “React Net: Towards Precise Binary Neural Network with Generalized Activation Functions,” in Proceedings of the European Conference on Computer Vision , Springer, 2020.
