Object Permanence Through Audio-Visual Representations
Fanjun Bu, Chien-Ming Huang

TL;DR
This paper presents a multimodal neural network that uses audio and partial visual data to predict the full trajectory of dropped objects, enabling robots to recover objects they lose from view.
Contribution
The work introduces a novel audio-visual neural network model for predicting object trajectories, improving robot error recovery in object manipulation tasks.
Findings
Predicted end locations within the robot's visual field.
Enabled robots to retrieve dropped objects with minimal adjustments.
Outperformed five baseline methods in object retrieval accuracy.
Abstract
As robots perform manipulation tasks and interact with objects, it is probable that they accidentally drop objects (e.g., due to an inadequate grasp of an unfamiliar object) that subsequently bounce out of their visual fields. To enable robots to recover from such errors, we draw upon the concept of object permanence-objects remain in existence even when they are not being sensed (e.g., seen) directly. In particular, we developed a multimodal neural network model-using a partial, observed bounce trajectory and the audio resulting from drop impact as its inputs-to predict the full bounce trajectory and the end location of a dropped object. We empirically show that: 1) our multimodal method predicted end locations close in proximity (i.e., within the visual field of the robot's wrist camera) to the actual locations and 2) the robot was able to retrieve dropped objects by applying minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
