AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen; Liangmin Wu; Yunhai Hu; Zhiyuan Li; Zhiyuan Cheng; Yicheng Qian; Lingyue Zhu; Zhipeng Hu; Luoyi Liang; Qiang Tang; Zhen Liu; and Han Yang

arXiv:2512.02924·cs.CL·December 9, 2025

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, and Han Yang

PDF

Open Access

TL;DR

AutoNeural introduces an NPU-optimized vision-language model architecture that significantly improves inference efficiency and robustness for edge AI applications by co-designing hardware-friendly components and reducing latency.

Contribution

The paper presents a novel NPU-native VLM architecture with a MobileNetV5-style backbone and SSM-based language model, optimized for integer-only inference and real-time edge deployment.

Findings

01

Reduces vision encoder quantization error by up to 7x

02

Achieves 14x lower end-to-end latency

03

Enables real-time automotive cockpit AI on Qualcomm SoC

Abstract

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications