Learning Self-Supervised Audio-Visual Representations for Sound   Recommendations

Sudha Krishnamurthy

arXiv:2412.07406·cs.CV·December 11, 2024

Learning Self-Supervised Audio-Visual Representations for Sound Recommendations

Sudha Krishnamurthy

PDF

TL;DR

This paper introduces a self-supervised learning method using attention mechanisms to develop audio-visual representations from unlabeled videos, enhancing sound correlation classification and sound effect recommendations.

Contribution

It presents a novel attention-based self-supervised approach that improves audio-visual representation learning for sound recommendation tasks.

Findings

01

Improves correlation accuracy by 18% on VGG-Sound

02

Enhances recommendation accuracy by 10% on VGG-Sound

03

Further improves performance with cross-modal contrastive learning

Abstract

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Contrastive Learning