Discriminative Sounding Objects Localization via Self-supervised   Audiovisual Matching

Di Hu; Rui Qian; Minyue Jiang; Xiao Tan; Shilei Wen; Errui Ding,; Weiyao Lin; Dejing Dou

arXiv:2010.05466·cs.CV·October 13, 2020·68 cites

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding,, Weiyao Lin, Dejing Dou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a self-supervised audiovisual framework for localizing sounding objects in complex sound scenes, leveraging learned object representations and cross-modal matching to improve accuracy.

Contribution

It presents a novel two-stage learning approach that combines self-supervised object representation learning with class-aware audiovisual matching for sound source localization.

Findings

01

Outperforms existing methods in localizing sound objects in cocktail-party scenarios.

02

Effectively filters silent objects and accurately identifies different sound classes.

03

Demonstrates robustness in both realistic and synthesized environments.

Abstract

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DTaoo/Discriminative-Sounding-Objects-Localization
pytorchOfficial

Videos

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization