VGGSounder: Audio-Visual Evaluations for Foundation Models
Daniil Zverev, Thadd\"aus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

TL;DR
VGGSounder is a re-annotated, multi-label dataset designed to improve the evaluation of audio-visual foundation models by addressing limitations of the original VGGSound dataset.
Contribution
We introduce VGGSounder, a comprehensive re-annotation of VGGSound with detailed modality labels and a new metric to analyze model limitations in multi-modal understanding.
Findings
VGGSounder enables more accurate evaluation of audio-visual models.
Analysis reveals model performance degradation with additional modalities.
VGGSounder addresses labeling and modality alignment issues of the original dataset.
Abstract
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
