SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen; Zijun Cui; Xiulong Liu; Jinlin Xiang; Caleb Zheng; Jingyuan Li; Eli Shlizerman

arXiv:2506.05414·cs.CV·June 9, 2025

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces SAVVY, a novel approach and benchmark for 3D spatial reasoning in dynamic audio-visual environments, significantly improving AV-LLMs' understanding of complex spatial relationships.

Contribution

The paper presents SAVVY-Bench, the first benchmark for dynamic 3D spatial reasoning, and proposes SAVVY, a training-free reasoning pipeline that enhances AV-LLMs' capabilities in this domain.

Findings

01

SAVVY improves AV-LLMs' performance on 3D spatial reasoning tasks.

02

SAVVY-Bench provides a comprehensive dataset for dynamic spatial relationships.

03

The pipeline effectively constructs global maps from multi-modal object trajectories.

Abstract

3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

uwneuroai/SAVVY-Bench
dataset· 72 dl
72 dl

Videos

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing· slideslive

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing

MethodsFocus