TL;DR
This paper introduces AV-Map, a multi-modal framework that combines audio and visual data to rapidly reconstruct detailed floorplans from limited viewpoints, outperforming existing visual-only methods.
Contribution
The paper presents AV-Map, a novel multi-modal encoder-decoder model that jointly reasons about audio and visual cues for efficient floorplan reconstruction from minimal input.
Findings
Achieves 66% accuracy in reconstructing entire floorplans from just 26% of the area.
Outperforms state-of-the-art visual-only mapping methods.
Effectively infers room types and unseen spaces using combined audio-visual data.
Abstract
Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense geometry outside the camera's field of view, but it also reveals the existence of distant freespace (e.g., a dog barking in another room) and suggests the presence of rooms not visible to the camera (e.g., a dishwasher humming in what must be the kitchen to the left). We introduce AV-Map, a novel multi-modal encoder-decoder framework that reasons jointly about audio and vision to reconstruct a floorplan from a short input video sequence. We train our model to predict both the interior structure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
