Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Chuang Gan; Yi Gu; Siyuan Zhou; Jeremy Schwartz; Seth Alter; James; Traer; Dan Gutfreund; Joshua B. Tenenbaum; Josh McDermott; Antonio Torralba

arXiv:2207.03483·cs.CV·July 8, 2022

Finding Fallen Objects Via Asynchronous Audio-Visual Integration

Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James, Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba

PDF

Open Access

TL;DR

This paper introduces a new task of localizing fallen objects in 3D virtual environments by integrating asynchronous audio and visual cues, supported by a large dataset and baseline models.

Contribution

It presents the Fallen Objects dataset, a physics-based simulation platform, and develops initial embodied agent baselines for multi-modal object localization.

Findings

01

Baseline models demonstrate the challenge of multi-modal asynchronous integration.

02

The dataset enables large-scale training and evaluation of audio-visual localization methods.

03

Analysis reveals key difficulties and future directions for embodied multi-modal perception.

Abstract

The way an object looks and sounds provide complementary reflections of its physical properties. In many settings cues from vision and audition arrive asynchronously but must be integrated, as when we hear an object dropped on the floor and then must find it. In this paper, we introduce a setting in which to study multi-modal object localization in 3D virtual environments. An object is dropped somewhere in a room. An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics. To study this problem, we have generated a large-scale dataset -- the Fallen Objects dataset -- that includes 8000 instances of 30 physical object categories in 64 rooms. The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis