Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

Ju Lin; Jing Pan; Ruizhi Li; Ming Sun; Yuzong Liu; Alaa Hassan; Jing Zheng; Florian Metze

arXiv:2602.07211·cs.CL·February 10, 2026

Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

Ju Lin, Jing Pan, Ruizhi Li, Ming Sun, Yuzong Liu, Alaa Hassan, Jing Zheng, Florian Metze

PDF

Open Access

TL;DR

This paper introduces methods to enhance large language models with directional multi-talker speech understanding using multi-microphone arrays in smart glasses, enabling improved recognition and translation in multi-talker scenarios.

Contribution

It proposes two novel approaches—cascaded and end-to-end—for integrating directivity into LLMs for multi-talker speech understanding in smart glasses.

Findings

01

Effective multi-talker speech recognition achieved

02

Enhanced speech translation performance demonstrated

03

Methods work in streaming, real-time settings

Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis