JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Zhan Liu; Changli Tang; Yuxin Wang; Zhiyuan Zhu; Youjun Chen; Yiwen Shao; Tianzi Wang; Lei Ke; Zengrui Jin; Chao Zhang

arXiv:2602.18527·cs.CV·February 24, 2026

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

PDF

Open Access

TL;DR

JAEGER advances audio-visual large language models by integrating 3D spatial reasoning with RGB-D and multi-channel audio, enabling improved source localization and understanding in complex physical environments.

Contribution

It introduces a novel 3D extension for AV-LLMs, including a neural intensity vector for spatial audio and a large-scale benchmark for training and evaluation.

Findings

01

Outperforms 2D-based models in spatial reasoning tasks

02

Effectively estimates direction-of-arrival in noisy environments

03

Enhances understanding of complex physical scenes

Abstract

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing