Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for   Audio-Visual Source Localization

Yuxin Guo; Shijie Ma; Hu Su; Zhiqing Wang; Yuhao Zhao; Wei Zou; Siyang; Sun; Yun Zheng

arXiv:2403.03145·cs.CV·March 6, 2024·6 cites

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization

Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou, Siyang, Sun, Yun Zheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Dual Mean-Teacher, a semi-supervised framework for audio-visual source localization that leverages two teacher-student models to improve localization accuracy, especially with limited labeled data.

Contribution

The paper proposes a novel dual teacher-student semi-supervised framework for AVSL that effectively filters noisy samples and generates high-quality pseudo-labels, outperforming existing methods.

Findings

01

Achieved 90.4% CIoU on Flickr-SoundNet

02

Improved performance by 8.9-9.6% over self- and semi-supervised methods

03

Enhanced existing AVSL methods with the proposed framework

Abstract

Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gyx-gloria/dmt
pytorchOfficial

Videos

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Digital Media Forensic Detection

MethodsContrastive Learning