POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Haicheng Wang; Yuan Liu; Yikun Liu; Zhemeng Yu; Zhongyin Zhao; Yangxiu You; Zilin Yu; Le Tian; Xiao Zhou; Jie Zhou; Weidi Xie; Yanfeng Wang

arXiv:2604.11627·cs.CV·April 14, 2026

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang

PDF

TL;DR

POINTS-Long introduces an adaptive dual-mode multimodal model that dynamically balances efficiency and accuracy in visual reasoning, especially for long videos and streaming scenarios.

Contribution

It presents a novel dual-mode architecture with dynamic token scaling and streaming visual understanding capabilities for scalable MLLMs.

Findings

01

Standby mode retains 97.7-99.7% accuracy with only 1/40-1/10th tokens.

02

Focus mode achieves optimal performance on fine-grained visual tasks.

03

Supports efficient long-form visual understanding via detachable KV-cache.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.