Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang

TL;DR
Dynamic-SUPERB Phase-2 is an extensive, evolving benchmark with 180 diverse speech and audio tasks designed to evaluate the capabilities of universal spoken language models across multiple modalities and task types.
Contribution
This paper introduces the second phase of Dynamic-SUPERB, expanding the benchmark with 125 new tasks and including diverse evaluation categories like regression and sequence generation.
Findings
No model performs well across all tasks.
SALMONN-13B excels in English ASR.
Qwen2-Audio-7B-Instruct is strong in emotion recognition.
Abstract
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was…
Peer Reviews
Decision·ICLR 2025 Poster
This work is comprehensive and well presented. It brings the largest benchmark by far for speech and audio evaluation. It also conducts detailed experiments to assess the performance of several popular audio LLMs on the proposed benchmark.
There're a few possible issues that may further improve this paper. - What is the motivation of creating this new benchmark? How will it guide the research community in advancing audio LLM research? I think the authors could emphasize this point further. - I understand that the number of tasks has significantly expanded in Phase 2 compared to Phase 1. However, what is the primary focus of these additional tasks? Are there specific challenges unresolved in Phase 1 that Phase 2 addresses?
1. Dynamic-SUPERB Phase-2 is the largest and most comprehensive benchmark for instruction-based universal speech models. It encompasses a wide range of tasks across speech, music, and audio, all paired with natural language instructions to evaluate models' cross-modal instruction-following abilities. This is a well-motivated and forward-looking approach that aligns with future trends in universal speech models, offering strong guidance for the field. 2. The paper is clearly written and easy to f
1. The benchmark lacks tasks for audio, speech, or music generation. 2. A primary concern is that the evaluation metrics heavily rely on large language models (LLMs) as referees. The reliability of these metrics is highly dependent on the capabilities of the LLMs themselves, which affects the benchmark's robustness and comparability. This reliance may also limit the benchmark's ability to expand to more complex tasks.
* The paper tackles perhaps one of the most important problems in the audio-LLM space: the lack of audio-text benchmarks. In contrast to text-LLMs, where sets of hundreds of tasks are readily available, audio-LLM development is hindered by the lack of standard testing suites. * I find the organization of the benchmarks to be thoroughly thought through. * The paper runs a nice study on comparing off-the-shelf models, thus validating the proposed benchmarks. * The paper validates referee model sel
* Using GPT-4o as a referee, I believe, jeopardizes reproducibility of the reported results and usefulness of the paper. Imagine that in two years someone wants to compare their model to the benchmark numbers reported in this paper. Would GPT-4o still be around? Would this particular version of it be still accessible? This is even more puzzling, since, according to Appendix E, LLaMA-3.1-70B-Instruct is extremely close to GPT-4o here. It seems to me that using an openly accessible model here woul
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification
