DETECLAP: Enhancing Audio-Visual Representation Learning with Object   Information

Shota Nakada; Taichi Nishimura; Hokuto Munakata; Masayoshi Kondo,; Tatsuya Komatsu

arXiv:2409.11729·cs.MM·September 19, 2024

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo,, Tatsuya Komatsu

PDF

Open Access

TL;DR

DETECLAP enhances audio-visual representation learning by incorporating object information through an automatic label prediction loss, leading to improved retrieval and classification performance without manual annotations.

Contribution

The paper introduces a novel method that integrates object information into audio-visual models using automatic labels, improving fine-grained recognition capabilities.

Findings

01

Improves recall@10 by +1.5% for audio-to-visual retrieval.

02

Enhances recall@10 by +1.2% for visual-to-audio retrieval.

03

Increases classification accuracy by +0.6%.

Abstract

Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies