POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang

TL;DR
POINTS-Long introduces an adaptive dual-mode multimodal model that dynamically balances efficiency and accuracy in visual reasoning, especially for long videos and streaming scenarios.
Contribution
It presents a novel dual-mode architecture with dynamic token scaling and streaming visual understanding capabilities for scalable MLLMs.
Findings
Standby mode retains 97.7-99.7% accuracy with only 1/40-1/10th tokens.
Focus mode achieves optimal performance on fine-grained visual tasks.
Supports efficient long-form visual understanding via detachable KV-cache.
Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
