Roadmap towards Superhuman Speech Understanding using Large Language   Models

Fan Bu; Yuhao Zhang; Xidong Wang; Benyou Wang; Qun Liu; Haizhou Li

arXiv:2410.13268·cs.CL·October 18, 2024

Roadmap towards Superhuman Speech Understanding using Large Language Models

Fan Bu, Yuhao Zhang, Xidong Wang, Benyou Wang, Qun Liu, Haizhou Li

PDF

Open Access 1 Datasets

TL;DR

This paper presents a comprehensive five-level roadmap for developing superhuman speech understanding using large language models, introduces a benchmark for evaluation, and discusses current limitations and future directions.

Contribution

It proposes a structured roadmap for speech LLM development, introduces the SAGI Benchmark for standardized evaluation, and analyzes current challenges in handling non-semantic and acoustic information.

Findings

01

Gaps in handling paralinguistic cues identified

02

Challenges in integrating abstract acoustic knowledge highlighted

03

Benchmark reveals limitations in current speech LLM capabilities

Abstract

The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FreedomIntelligence/DitingBench
dataset· 157 dl
157 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques