OSUM: Advancing Open Speech Understanding Models with Limited Resources   in Academia

Xuelong Geng; Kun Wei; Qijie Shao; Shuiyun Liu; Zhennan Lin; Zhixian; Zhao; Guojian Li; Wenjie Tian; Peikun Chen; Yangze Li; Pengcheng Guo,; Mingchen Shao; Shuiyuan Wang; Yuang Cao; Chengyou Wang; Tianyi Xu; Yuhang; Dai; Xinfa Zhu; Yue Li; Li Zhang; Lei Xie

arXiv:2501.13306·cs.SD·February 18, 2025

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian, Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo,, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang, Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie

PDF

Open Access 1 Repo 1 Models

TL;DR

OSUM is an open, resource-efficient speech understanding model that integrates multiple speech tasks, promoting transparency and accessibility for academic research in speech AI.

Contribution

We introduce OSUM, a multi-task speech understanding model designed for academic settings with limited resources, emphasizing transparency and practical training strategies.

Findings

01

OSUM achieves competitive performance across various speech tasks.

02

The model demonstrates stable multi-task training with an ASR+X strategy.

03

Open data and methodology facilitate academic research and innovation.

Abstract

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aslp-lab/osum
pytorchOfficial

Models

🤗
ASLP-lab/OSUM
model· ♡ 12
♡ 12

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques