VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev; Thadd\"aus Wiedemer; Ameya Prabhu; Matthias Bethge; Wieland Brendel; A. Sophia Koepke

arXiv:2508.08237·cs.MM·October 21, 2025

VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev, Thadd\"aus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

PDF

Open Access

TL;DR

VGGSounder is a re-annotated, multi-label dataset designed to improve the evaluation of audio-visual foundation models by addressing limitations of the original VGGSound dataset.

Contribution

We introduce VGGSounder, a comprehensive re-annotation of VGGSound with detailed modality labels and a new metric to analyze model limitations in multi-modal understanding.

Findings

01

VGGSounder enables more accurate evaluation of audio-visual models.

02

Analysis reveals model performance degradation with additional modalities.

03

VGGSounder addresses labeling and modality alignment issues of the original dataset.

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing