ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

Jinwu Yang; Jiaan Wu; Zedong Liu; Xinyang Ma; Hairui Zhao; Yida Gu; Yuanhong Huang; Xingchen Liu; Wenjing Huang; Zheng Wei; Jing Xing; Yili Ma; Qingyi Zhang; Baoyi An; Zhongzhe Hu; Shaoteng Liu; Xia Zhu; Jiaxun Lu; Guangming Tan; Dingwen Tao

arXiv:2604.03298·cs.AR·April 8, 2026

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, Dingwen Tao

PDF

1 Repo

TL;DR

ENEC is a lossless compression method optimized for Ascend NPUs that significantly reduces model weight transfer bottlenecks, enabling faster inference of large AI models.

Contribution

ENEC introduces a novel NPU-specific lossless compression technique with optimized encoding and transformations, outperforming existing compressors in throughput and compression ratio.

Findings

01

ENEC achieves up to 6.3X inference speedup on Ascend NPUs.

02

ENEC outperforms DietGPU by 3.43X in throughput.

03

ENEC provides better compression ratio than nvCOMP.

Abstract

The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hpdps-group/ENEC
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.