Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities
Ju Lin, Jing Pan, Ruizhi Li, Ming Sun, Yuzong Liu, Alaa Hassan, Jing Zheng, Florian Metze

TL;DR
This paper introduces methods to enhance large language models with directional multi-talker speech understanding using multi-microphone arrays in smart glasses, enabling improved recognition and translation in multi-talker scenarios.
Contribution
It proposes two novel approaches—cascaded and end-to-end—for integrating directivity into LLMs for multi-talker speech understanding in smart glasses.
Findings
Effective multi-talker speech recognition achieved
Enhanced speech translation performance demonstrated
Methods work in streaming, real-time settings
Abstract
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
