Disentangled Acoustic Fields For Multimodal Physical Scene Understanding
Jie Yin, Andrew Luo, Yilun Du, Anoop Cherian, Tim K. Marks, Jonathan, Le Roux, and Chuang Gan

TL;DR
This paper introduces a disentangled acoustic field model for multimodal physical scene understanding, enabling better localization of fallen objects through explicit sound generation and propagation modeling.
Contribution
It proposes a novel disentangled acoustic field model and an analysis-by-synthesis framework for improved sound-based object localization in embodied agents.
Findings
Disentangled acoustic field captures sound generation and propagation.
Spatial uncertainty maps improve object localization success.
Explicit modeling enhances generalization over direct regression methods.
Abstract
We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can…
Peer Reviews
Decision·Submitted to ICLR 2024
The problem is interesting.
1. The proposed work lacks novelty. The main contribution of this work is the introduction of DAFs, which are applied to the generative frameworks and multitask learning to address this task. Although it shows promising performance, it lacks some degree of innovation. 2. The paper mentions that NAFs lack generalization ability for new scenarios, while DAFs can effectively solve this problem. However, no relevant comparative experiments are observed in the experimental section. 3. The paper menti
+ Transforming the sound from the waveform domain into power spectral density (PSD) representation rather than sound reconstruction is well-motivated. + The proposed disentangled acoustic fields (DAFs) are an interesting and technically sound model that explicitly disentangles sounds into several different acoustic factors. + DAFs can be used to infer the physical properties of a scene, represent uncertainty, and navigate and find fallen objects.
+ A video demo would be very helpful for us to understand the model's performance on the localization of fallen objects in the real world. + Currently, all of the experiments are conducted on synthetic datasets. It would be interesting to see how the model generalizes to real-world data. + The proposed method requires full labels to train DAFs. It would be beneficial to develop a self-supervised learning approach to avoid using many labels during model training. + Why not incorporate visual
1. The main contribution of this paper is to enhance audio perception by so-called analysis-by-synthesis framework. The major strength is to maintain the (generated/synthesis one) power spectral density (PSD) consistency with the input audio. 2. The downstream multi-modal planning experiments demonstrates the effectiveness of proposed framework.
1. The title “physical scene understanding” could be overclaimed the contribution since the audio perception is limited to constrained scenarios (e.g., fallen objects). 2. Though the DAFs or (the predict-generate) is novel in audio modality, it is not such innovative and is close to use the cycle-consistency to ensure robustness in vision modality. I would suggest this work more like an ICRA paper instead of ICLR.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Image Processing and 3D Reconstruction · Speech Recognition and Synthesis
