LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang

TL;DR
LLMind introduces a bio-inspired, training-free adaptive sampling framework for vision-language models, significantly improving efficiency and performance under limited pixel budgets by mimicking human visual attention mechanisms.
Contribution
It proposes a novel bio-inspired adaptive sampling strategy and test-time semantic feedback, enabling efficient, non-uniform visual representations without retraining existing models.
Findings
Achieves +20% on VQAv2 with limited pixels
Retains up to 97% of full-resolution performance with only 5% pixels
Outperforms uniform sampling baselines across multiple benchmarks
Abstract
Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
