Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

TL;DR
Blink introduces a human-inspired dynamic visual token resolution method for multimodal models, improving their ability to perceive complex scenes by focusing on salient regions adaptively within a single pass.
Contribution
The paper proposes Blink, a novel framework that dynamically allocates computational resources to salient visual tokens, enhancing multimodal understanding in large language models.
Findings
Blink improves visual perception in multimodal models.
Dynamic token resolution enhances scene understanding.
Experimental results show significant performance gains.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Neurobiology of Language and Bilingualism
