SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman

TL;DR
This paper introduces SAVVY, a novel approach and benchmark for 3D spatial reasoning in dynamic audio-visual environments, significantly improving AV-LLMs' understanding of complex spatial relationships.
Contribution
The paper presents SAVVY-Bench, the first benchmark for dynamic 3D spatial reasoning, and proposes SAVVY, a training-free reasoning pipeline that enhances AV-LLMs' capabilities in this domain.
Findings
SAVVY improves AV-LLMs' performance on 3D spatial reasoning tasks.
SAVVY-Bench provides a comprehensive dataset for dynamic spatial relationships.
The pipeline effectively constructs global maps from multi-modal object trajectories.
Abstract
3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing
MethodsFocus
