Visuospatial Cognitive Assistant
Qi Feng

TL;DR
This paper introduces a large dataset and a new model for visuospatial reasoning in robotics, achieving state-of-the-art results and emphasizing the importance of targeted data for spatial cognition tasks.
Contribution
The paper presents ViCA-322K, a large dataset for visuospatial reasoning, and ViCA-7B, a model that outperforms existing models on multiple tasks, advancing the field of spatial cognition in AI.
Findings
ViCA-7B achieves state-of-the-art on all eight VSI-Bench tasks.
ViCA-322K provides extensive supervision for 3D and video reasoning.
Fine-tuning on explicit reasoning chains improves interpretability.
Abstract
Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Robotics and Sensor-Based Localization
