DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo,, Tatsuya Komatsu

TL;DR
DETECLAP enhances audio-visual representation learning by incorporating object information through an automatic label prediction loss, leading to improved retrieval and classification performance without manual annotations.
Contribution
The paper introduces a novel method that integrates object information into audio-visual models using automatic labels, improving fine-grained recognition capabilities.
Findings
Improves recall@10 by +1.5% for audio-to-visual retrieval.
Enhances recall@10 by +1.2% for visual-to-audio retrieval.
Increases classification accuracy by +0.6%.
Abstract
Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
