A Large-Scale Evaluation of Speech Foundation Models
Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai,, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang,, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang,, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li

TL;DR
This paper introduces SUPERB, a comprehensive benchmark for evaluating speech foundation models, demonstrating their effectiveness across multiple tasks and providing a platform for reproducible, collaborative research in speech processing.
Contribution
It establishes the SUPERB benchmark, proposes a unified multi-tasking framework for speech tasks, and develops a platform for reproducible, community-driven evaluation of speech foundation models.
Findings
Foundation models show promising generalizability across SUPERB tasks.
The multi-tasking framework is simple yet effective for speech tasks.
Benchmarking results are robust and statistically significant.
Abstract
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
