MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bing Qin

TL;DR
The paper introduces MCGA, a comprehensive 119-hour audio corpus of Classical Chinese literary genres designed to evaluate and advance multimodal large language models in underexplored audio tasks.
Contribution
It presents a new multi-task audio corpus for Classical Chinese literature, evaluates existing models, and proposes domain-specific metrics to improve MLLMs in this niche.
Findings
Current MLLMs face significant challenges on MCGA tasks.
The corpus enables evaluation across six diverse tasks.
Proposed metrics help measure speech-text capability consistency.
Abstract
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has gained significant attention in Chinese Classical Studies (CCS). While existing research primarily focuses on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we introduce the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA), a 119-hour corpus comprising 22,000 audio samples. It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current MLLMs still face substantial challenges on the MCGA test set. Furthermore, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
