CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual   Navigation in Noisy Environments

Xiulong Liu; Sudipta Paul; Moitreya Chatterjee; Anoop Cherian

arXiv:2306.04047·cs.CV·December 29, 2023·1 cites

CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, Anoop Cherian

PDF

Open Access

TL;DR

CAVEN introduces a conversational audio-visual navigation framework enabling an agent to interact with humans for improved localization of audio goals in noisy environments, significantly enhancing success rates.

Contribution

The paper presents CAVEN, a novel framework combining natural language interaction with audio-visual navigation, and introduces AVN-Instruct, a large dataset for training such interactive agents.

Findings

01

Nearly tenfold increase in success rate with conversational approach

02

Effective in localizing new sound sources in noisy environments

03

Outperforms uni-directional interaction methods

Abstract

Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: (i) a trajectory forecasting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech and dialogue systems