Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu, Yuhao Zhang, Xidong Wang, Benyou Wang, Qun Liu, Haizhou Li

TL;DR
This paper presents a comprehensive five-level roadmap for developing superhuman speech understanding using large language models, introduces a benchmark for evaluation, and discusses current limitations and future directions.
Contribution
It proposes a structured roadmap for speech LLM development, introduces the SAGI Benchmark for standardized evaluation, and analyzes current challenges in handling non-semantic and acoustic information.
Findings
Gaps in handling paralinguistic cues identified
Challenges in integrating abstract acoustic knowledge highlighted
Benchmark reveals limitations in current speech LLM capabilities
Abstract
The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
