ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

Wenxuan Wu; Shuai Wang; Xixin Wu; Helen Meng; Haizhou Li

arXiv:2511.06288·cs.SD·November 11, 2025

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

PDF

Open Access

TL;DR

ELEGANCE introduces a framework that enhances audio-visual target speech extraction by integrating linguistic knowledge from large language models, significantly improving performance in challenging scenarios.

Contribution

It is the first to incorporate LLM-derived linguistic guidance into AV-TSE models, combining three strategies for improved speech extraction.

Findings

01

Improves extraction accuracy in visual cue impaired scenarios

02

Enhances performance on unseen languages and speaker switches

03

Effective across multiple AV-TSE backbones

Abstract

Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demonstrate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing