Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Chien-yu Huang; Wei-Chih Chen; Shu-wen Yang; Andy T. Liu; Chen-An Li; Yu-Xiang Lin; Wei-Cheng Tseng; Anuj Diwan; Yi-Jen Shih; Jiatong Shi; William Chen; Chih-Kai Yang; Wenze Ren; Xuanjun Chen; Chi-Yuan Hsiao; Puyuan Peng; Shih-Heng Wang; Chun-Yi Kuan; Ke-Han Lu; Kai-Wei Chang; Fabian Ritter-Gutierrez; Kuan-Po Huang; Siddhant Arora; You-Kuan Lin; Ming To Chuang; Eunjung Yeo; Kalvin Chang; Chung-Ming Chien; Kwanghee Choi; Jun-You Wang; Cheng-Hsiu Hsieh; Yi-Cheng Lin; Chee-En Yu; I-Hsiang Chiu; Heitor R. Guimar\~aes; Jionghao Han; Tzu-Quan Lin; Tzu-Yuan Lin; Homu Chang; Ting-Wu Chang; Chun Wei Chen; Shou-Jen Chen; Yu-Hua Chen; Hsi-Chun Cheng; Kunal Dhawan; Jia-Lin Fang; Shi-Xin Fang; Kuan-Yu Fang Chiang; Chi An Fu; Hsien-Fu Hsiao; Ching Yu Hsu; Shao-Syuan Huang; Lee Chen Wei; Hsi-Che Lin; Hsuan-Hao Lin; Hsuan-Ting Lin; Jian-Ren Lin; Ting-Chun Liu; Li-Chun Lu; Tsung-Min Pai; Ankita Pasad; Shih-Yun Shan Kuan; Suwon Shon; Yuxun Tang; Yun-Shao Tsai; Jui-Chiang Wei; Tzu-Chieh Wei; Chengxi Wu; Dien-Ruei Wu; Chao-Han Huck Yang; Chieh-Chi Yang; Jia Qi Yip; Shao-Xiang Yuan; Vahid Noroozi; Zhehuai Chen; Haibin Wu; Karen Livescu; David Harwath; Shinji Watanabe; Hung-yi Lee

arXiv:2411.05361·cs.CL·June 10, 2025

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang

PDF

Open Access 1 Repo 2 Datasets 3 Reviews

TL;DR

Dynamic-SUPERB Phase-2 is an extensive, evolving benchmark with 180 diverse speech and audio tasks designed to evaluate the capabilities of universal spoken language models across multiple modalities and task types.

Contribution

This paper introduces the second phase of Dynamic-SUPERB, expanding the benchmark with 125 new tasks and including diverse evaluation categories like regression and sequence generation.

Findings

01

No model performs well across all tasks.

02

SALMONN-13B excels in English ASR.

03

Qwen2-Audio-7B-Instruct is strong in emotion recognition.

Abstract

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

This work is comprehensive and well presented. It brings the largest benchmark by far for speech and audio evaluation. It also conducts detailed experiments to assess the performance of several popular audio LLMs on the proposed benchmark.

Weaknesses

There're a few possible issues that may further improve this paper. - What is the motivation of creating this new benchmark? How will it guide the research community in advancing audio LLM research? I think the authors could emphasize this point further. - I understand that the number of tasks has significantly expanded in Phase 2 compared to Phase 1. However, what is the primary focus of these additional tasks? Are there specific challenges unresolved in Phase 1 that Phase 2 addresses?

Reviewer 02Rating 6Confidence 3

Strengths

1. Dynamic-SUPERB Phase-2 is the largest and most comprehensive benchmark for instruction-based universal speech models. It encompasses a wide range of tasks across speech, music, and audio, all paired with natural language instructions to evaluate models' cross-modal instruction-following abilities. This is a well-motivated and forward-looking approach that aligns with future trends in universal speech models, offering strong guidance for the field. 2. The paper is clearly written and easy to f

Weaknesses

1. The benchmark lacks tasks for audio, speech, or music generation. 2. A primary concern is that the evaluation metrics heavily rely on large language models (LLMs) as referees. The reliability of these metrics is highly dependent on the capabilities of the LLMs themselves, which affects the benchmark's robustness and comparability. This reliance may also limit the benchmark's ability to expand to more complex tasks.

Reviewer 03Rating 8Confidence 3

Strengths

* The paper tackles perhaps one of the most important problems in the audio-LLM space: the lack of audio-text benchmarks. In contrast to text-LLMs, where sets of hundreds of tasks are readily available, audio-LLM development is hindered by the lack of standard testing suites. * I find the organization of the benchmarks to be thoroughly thought through. * The paper runs a nice study on comparing off-the-shelf models, thus validating the proposed benchmarks. * The paper validates referee model sel

Weaknesses

* Using GPT-4o as a referee, I believe, jeopardizes reproducibility of the reported results and usefulness of the paper. Imagine that in two years someone wants to compare their model to the benchmark numbers reported in this paper. Would GPT-4o still be around? Would this particular version of it be still accessible? This is even more puzzling, since, according to Appendix E, LLaMA-3.1-70B-Instruct is extremely close to GPT-4o here. It seems to me that using an openly accessible model here woul

Code & Models

Repositories

dynamic-superb/dynamic-superb
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification