AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual   Representation Models

Yuan Tseng; Layne Berry; Yi-Ting Chen; I-Hsiang Chiu; Hsuan-Hao Lin,; Max Liu; Puyuan Peng; Yi-Jen Shih; Hung-Yu Wang; Haibin Wu; Po-Yao Huang,; Chun-Mao Lai; Shang-Wen Li; David Harwath; Yu Tsao; Shinji Watanabe,; Abdelrahman Mohamed; Chi-Luen Feng; Hung-yi Lee

arXiv:2309.10787·eess.AS·March 20, 2024·1 cites

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin,, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang,, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe,, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

PDF

Open Access 1 Repo

TL;DR

The paper introduces AV-SUPERB, a comprehensive benchmark for evaluating the generalization of audio-visual representation models across multiple tasks and datasets, highlighting current models' limitations and potential improvements.

Contribution

It presents a new multi-task benchmark for audio-visual models, evaluates recent models' generalization, and suggests intermediate-task fine-tuning for better representations.

Findings

01

None of the evaluated models generalize to all tasks.

02

Intermediate-task fine-tuning improves representations.

03

AudioSet classification is a strong intermediate task.

Abstract

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roger-tseng/av-superb
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation

MethodsNone · Focus