Audio-Visual Grounding Referring Expression for Robotic Manipulation

Yefei Wang; Kaili Wang; Yi Wang; Di Guo; Huaping Liu; Fuchun Sun

arXiv:2109.10571·cs.RO·September 23, 2021·1 cites

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Yefei Wang, Kaili Wang, Yi Wang, Di Guo, Huaping Liu, Fuchun Sun

PDF

Open Access

TL;DR

This paper introduces a new audio-visual grounding task for robotic manipulation, enabling robots to better understand and execute instructions by integrating sound and visual cues, with a new dataset and experimental validation.

Contribution

The paper proposes a novel audio-visual framework for grounding referring expressions in robotic manipulation, including dataset creation and comprehensive experiments.

Findings

01

Robots perform better with combined audio-visual data than visual data alone.

02

The proposed framework effectively localizes targets and recognizes sounds in manipulation tasks.

03

Extensive offline and online experiments validate the approach's effectiveness.

Abstract

Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition. We have also established a dataset which contains visual data, auditory data and manipulation instructions for evaluation. Finally, extensive experiments are conducted both offline and online to verify the effectiveness of the proposed audio-visual framework. And it is demonstrated that the robot performs better with the audio-visual data than with only the visual data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and dialogue systems