Semantic Audio-Visual Navigation in Continuous Environments

Yichen Zeng; Hebaixu Wang; Meng Liu; Yu Zhou; Chen Gao; Kehan Chen; Gongping Huang

arXiv:2603.19660·cs.CV·April 2, 2026

Semantic Audio-Visual Navigation in Continuous Environments

Yichen Zeng, Hebaixu Wang, Meng Liu, Yu Zhou, Chen Gao, Kehan Chen, Gongping Huang

PDF

1 Repo

TL;DR

This paper introduces SAVN-CE, a realistic continuous environment for audio-visual navigation, and proposes MAGNet, a transformer-based model that improves goal reasoning and navigation success.

Contribution

It presents a new continuous environment for audio-visual navigation and a multimodal transformer model that enhances goal reasoning and robustness in complex scenarios.

Findings

01

MAGNet outperforms state-of-the-art methods with up to 12.1% success rate improvement.

02

The model is robust to short-duration sounds and long-distance navigation.

03

SAVN-CE offers a more realistic setting for embodied audio-visual navigation research.

Abstract

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yichenzeng24/SAVN-CE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.