Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

Jia Li; Yinfeng Yu

arXiv:2604.05007·cs.SD·April 8, 2026

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

Jia Li, Yinfeng Yu

PDF

TL;DR

This paper introduces BDATP, a novel framework for audio-visual navigation that improves generalization to unseen environments by modeling interaural differences and predicting action transitions.

Contribution

The paper proposes the BDATP framework combining binaural difference attention and action transition prediction to enhance AVN generalization and performance.

Findings

01

Achieves state-of-the-art success rates on Replica and Matterport3D datasets.

02

Up to 21.6 percentage points improvement in success rate for unheard sounds.

03

Demonstrates robustness across various navigation architectures.

Abstract

In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.