Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

Jiamin Xie; Ju Lin; Yiteng Huang; Tyler Vuong; Zhaojiang Lin; Zhaojun Yang; Peng Su; Prashant Rawat; Sangeeta Srivastava; Ming Sun; Florian Metze

arXiv:2506.14973·eess.AS·June 19, 2025

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

Jiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun, Florian Metze

PDF

Open Access

TL;DR

This paper introduces directional-SpeechLlama, a speech recognition model that uses microphone arrays on smart glasses to improve multi-talker understanding, source localization, and cross-talk suppression by leveraging spatial audio cues.

Contribution

It presents a novel directional speech recognition approach with two key techniques, S-DOT and CDDA, enhancing spatial audio comprehension in large language models.

Findings

01

Effective multi-talker speech recognition and source localization.

02

Strong performance in spatial audio understanding tasks.

03

Suppression of bystander cross-talk.

Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis