Sound Localization by Self-Supervised Time Delay Estimation

Ziyang Chen; David F. Fouhey; Andrew Owens

arXiv:2204.12489·cs.CV·January 31, 2023

Sound Localization by Self-Supervised Time Delay Estimation

Ziyang Chen, David F. Fouhey, Andrew Owens

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-supervised learning approach for sound localization by estimating interaural time delays, leveraging contrastive learning techniques and multimodal cues to achieve competitive results without labeled data.

Contribution

It adapts contrastive random walk for self-supervised sound correspondence learning and integrates visual cues for improved multi-speaker localization.

Findings

01

Performs on par with supervised methods on internet recordings.

02

Effective multimodal localization of specific speakers in mixtures.

03

Self-supervised approach reduces reliance on labeled data.

Abstract

Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IFICL/stereocrw
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsContrastive Learning