Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal   Distillation

Heeseung Yun; Joonil Na; Gunhee Kim

arXiv:2309.11081·cs.CV·September 21, 2023·1 cites

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Heeseung Yun, Joonil Na, Gunhee Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal distillation framework that uses sound to enhance dense indoor 2D and 3D spatial predictions, achieving state-of-the-art results without relying on specific input representations.

Contribution

It proposes the SAM distillation framework for aligning audio and visual features, and introduces the DAPS benchmark for dense auditory indoor scene prediction.

Findings

01

Achieves state-of-the-art performance in audio-based depth estimation.

02

Effective in semantic segmentation and 3D scene reconstruction.

03

Flexible input handling without performance loss.

Abstract

Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hs-yn/daps
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

MethodsSegment Anything Model