TL;DR
This paper introduces SAVN-CE, a realistic continuous environment for audio-visual navigation, and proposes MAGNet, a transformer-based model that improves goal reasoning and navigation success.
Contribution
It presents a new continuous environment for audio-visual navigation and a multimodal transformer model that enhances goal reasoning and robustness in complex scenarios.
Findings
MAGNet outperforms state-of-the-art methods with up to 12.1% success rate improvement.
The model is robust to short-duration sounds and long-distance navigation.
SAVN-CE offers a more realistic setting for embodied audio-visual navigation research.
Abstract
Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
