Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo; Pedro Morgado

arXiv:2407.13095·cs.CV·July 19, 2024

Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo, Pedro Morgado

PDF

Open Access

TL;DR

This paper introduces EZ-AVGZL, a simple framework for audio-visual generalized zero-shot learning that aligns embeddings with text representations using contrastive loss, achieving state-of-the-art results.

Contribution

The paper proposes a novel, straightforward approach that aligns audio-visual and textual embeddings with a contrastive loss, improving zero-shot learning performance.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Effectively aligns audio-visual features with language representations.

03

Demonstrates the benefits of differential optimization for class separation.

Abstract

Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning