HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team: Yuchen Liu; Kaiyang Han; Zhiqiang Xia; Yuhang Dong; Chen Song; Kangyu Tang; Jiaming Xu; Xiushi Feng; WenXuan Yu; Li Peng; Mingyang Wang; Kai Wang; Changpeng Yang; Yang Li; Haoyu Lu; Hao Wang; Bingna Xu; Guangyao Liu; Long Huang; Kaibin Guo; Jinyang Wu; Dan Wu; Hongzhen Wang; Peng Zhou; Shuai Nie; Shande Wang; Runyu Shi; Ying Huang

arXiv:2512.14052·cs.CV·December 17, 2025

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team: Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu

PDF

Open Access

TL;DR

HyperVL is a novel multimodal large language model optimized for edge devices, combining innovative techniques to reduce computational costs while maintaining high performance.

Contribution

It introduces a new image-tiling strategy, a Visual Resolution Compressor, and Dual Consistency Learning to enable efficient on-device multimodal inference.

Findings

01

Achieves state-of-the-art performance among small models

02

Reduces latency and power consumption on mobile devices

03

Supports high-resolution input processing efficiently

Abstract

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning