Visuospatial Cognitive Assistant

Qi Feng

arXiv:2505.12312·cs.CV·September 10, 2025

Visuospatial Cognitive Assistant

Qi Feng

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces a large dataset and a new model for visuospatial reasoning in robotics, achieving state-of-the-art results and emphasizing the importance of targeted data for spatial cognition tasks.

Contribution

The paper presents ViCA-322K, a large dataset for visuospatial reasoning, and ViCA-7B, a model that outperforms existing models on multiple tasks, advancing the field of spatial cognition in AI.

Findings

01

ViCA-7B achieves state-of-the-art on all eight VSI-Bench tasks.

02

ViCA-322K provides extensive supervision for 3D and video reasoning.

03

Fine-tuning on explicit reasoning chains improves interpretability.

Abstract

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nkkbr/vica
pytorch

Models

Datasets

nkkbr/ViCA-322K
dataset· 1.8k dl
1.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Robotics and Sensor-Based Localization