Object Permanence Through Audio-Visual Representations

Fanjun Bu; Chien-Ming Huang

arXiv:2010.09948·cs.RO·October 5, 2021

Object Permanence Through Audio-Visual Representations

Fanjun Bu, Chien-Ming Huang

PDF

TL;DR

This paper presents a multimodal neural network that uses audio and partial visual data to predict the full trajectory of dropped objects, enabling robots to recover objects they lose from view.

Contribution

The work introduces a novel audio-visual neural network model for predicting object trajectories, improving robot error recovery in object manipulation tasks.

Findings

01

Predicted end locations within the robot's visual field.

02

Enabled robots to retrieve dropped objects with minimal adjustments.

03

Outperformed five baseline methods in object retrieval accuracy.

Abstract

As robots perform manipulation tasks and interact with objects, it is probable that they accidentally drop objects (e.g., due to an inadequate grasp of an unfamiliar object) that subsequently bounce out of their visual fields. To enable robots to recover from such errors, we draw upon the concept of object permanence-objects remain in existence even when they are not being sensed (e.g., seen) directly. In particular, we developed a multimodal neural network model-using a partial, observed bounce trajectory and the audio resulting from drop impact as its inputs-to predict the full bounce trajectory and the end location of a dropped object. We empirically show that: 1) our multimodal method predicted end locations close in proximity (i.e., within the visual field of the robot's wrist camera) to the actual locations and 2) the robot was able to retrieve dropped objects by applying minimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.