OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian, Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo,, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang, Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie

TL;DR
OSUM is an open, resource-efficient speech understanding model that integrates multiple speech tasks, promoting transparency and accessibility for academic research in speech AI.
Contribution
We introduce OSUM, a multi-task speech understanding model designed for academic settings with limited resources, emphasizing transparency and practical training strategies.
Findings
OSUM achieves competitive performance across various speech tasks.
The model demonstrates stable multi-task training with an ASR+X strategy.
Open data and methodology facilitate academic research and innovation.
Abstract
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
