Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
Qi Feng

TL;DR
This paper introduces ViCA2, a multimodal large language model designed specifically for visuospatial reasoning, utilizing a hierarchical fusion architecture and a large spatial question-answer dataset, achieving state-of-the-art results.
Contribution
The paper presents ViCA2, a novel MLLM with hierarchical fusion of visual experts and a new large-scale spatial dataset, advancing visuospatial cognition capabilities.
Findings
ViCA2-7B achieves 56.8 on VSI-Bench, outperforming larger models.
Hierarchical fusion improves spatial reasoning accuracy.
The dataset ViCA-322K enables targeted instruction tuning.
Abstract
While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
