A Large-Scale Evaluation of Speech Foundation Models

Shu-wen Yang; Heng-Jui Chang; Zili Huang; Andy T. Liu; Cheng-I Lai,; Haibin Wu; Jiatong Shi; Xuankai Chang; Hsiang-Sheng Tsai; Wen-Chin Huang,; Tzu-hsun Feng; Po-Han Chi; Yist Y. Lin; Yung-Sung Chuang; Tzu-Hsien Huang,; Wei-Cheng Tseng; Kushal Lakhotia; Shang-Wen Li; Abdelrahman Mohamed; Shinji; Watanabe; Hung-yi Lee

arXiv:2404.09385·eess.AS·May 31, 2024·1 cites

A Large-Scale Evaluation of Speech Foundation Models

Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai,, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang,, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang,, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces SUPERB, a comprehensive benchmark for evaluating speech foundation models, demonstrating their effectiveness across multiple tasks and providing a platform for reproducible, collaborative research in speech processing.

Contribution

It establishes the SUPERB benchmark, proposes a unified multi-tasking framework for speech tasks, and develops a platform for reproducible, community-driven evaluation of speech foundation models.

Findings

01

Foundation models show promising generalizability across SUPERB tasks.

02

The multi-tasking framework is simple yet effective for speech tasks.

03

Benchmarking results are robust and statistically significant.

Abstract

The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s3prl/s3prl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis