UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

Qundong Shi; Jie Zhou; Biyuan Lin; Junbo Cui; Guoyang Zeng; Yixuan Zhou; Ziyang Wang; Xin Liu; Zhen Luo; Yudong Wang; Zhiyuan Liu

arXiv:2601.01373·cs.SD·January 6, 2026

UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, Zhiyuan Liu

PDF

Open Access 1 Datasets

TL;DR

UltraEval-Audio is a comprehensive, unified evaluation framework for audio foundation models that supports multiple languages, tasks, and models, and introduces new benchmarks for Chinese speech assessment.

Contribution

It provides a modular, multi-language evaluation platform with real-time leaderboards and novel assessment schemes for audio codecs and Chinese speech benchmarks.

Findings

01

Supports 10 languages and 14 core tasks

02

Integrates 24 models and 36 benchmarks

03

Introduces new Chinese speech benchmarks

Abstract

The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models' performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TwinkStart/speech-CMMLU
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing