JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

TL;DR
JAEGER advances audio-visual large language models by integrating 3D spatial reasoning with RGB-D and multi-channel audio, enabling improved source localization and understanding in complex physical environments.
Contribution
It introduces a novel 3D extension for AV-LLMs, including a neural intensity vector for spatial audio and a large-scale benchmark for training and evaluation.
Findings
Outperforms 2D-based models in spatial reasoning tasks
Effectively estimates direction-of-arrival in noisy environments
Enhances understanding of complex physical scenes
Abstract
Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing
