Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo, Pedro Morgado

TL;DR
This paper introduces EZ-AVGZL, a simple framework for audio-visual generalized zero-shot learning that aligns embeddings with text representations using contrastive loss, achieving state-of-the-art results.
Contribution
The paper proposes a novel, straightforward approach that aligns audio-visual and textual embeddings with a contrastive loss, improving zero-shot learning performance.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Effectively aligns audio-visual features with language representations.
Demonstrates the benefits of differential optimization for class separation.
Abstract
Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning
