Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou, Siyang, Sun, Yun Zheng

TL;DR
This paper introduces Dual Mean-Teacher, a semi-supervised framework for audio-visual source localization that leverages two teacher-student models to improve localization accuracy, especially with limited labeled data.
Contribution
The paper proposes a novel dual teacher-student semi-supervised framework for AVSL that effectively filters noisy samples and generates high-quality pseudo-labels, outperforming existing methods.
Findings
Achieved 90.4% CIoU on Flickr-SoundNet
Improved performance by 8.9-9.6% over self- and semi-supervised methods
Enhanced existing AVSL methods with the proposed framework
Abstract
Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Digital Media Forensic Detection
MethodsContrastive Learning
