Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

Hsiang-Cheng Yang; You-Jin Li; Rong Chao; Yu Tsao; Borching Su; Shao-Yi Chien

arXiv:2604.08359·eess.AS·April 10, 2026

Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

Hsiang-Cheng Yang, You-Jin Li, Rong Chao, Yu Tsao, Borching Su, Shao-Yi Chien

PDF

TL;DR

This paper introduces a gaze-guided audio-visual speech enhancement framework that improves target speaker identification in multi-talker environments by leveraging gaze direction cues, demonstrating significant performance gains.

Contribution

The paper proposes a novel GG-AVSE framework combining gaze signals with AVSE models, and introduces the AVSEC2-Gaze dataset for evaluation.

Findings

01

Achieved 10.08% improvement in PESQ

02

Achieved 5.18% improvement in STOI

03

Achieved 23.69% improvement in SI-SDR

Abstract

This paper presents a Gaze-Guided Audio-Visual Speech Enhancement (GG-AVSE) framework to address the cocktail party problem. A major challenge in conventional AVSE is identifying the listener's intended speaker in multi-talker environments. GG-AVSE addresses this issue by exploiting gaze direction as a supervisory cue for target-speaker selection. Specifically, we propose the GG-VM module, which combines gaze signals with a YOLO5Face detector to extract the target speaker's facial features and integrates them with the pretrained AVSEMamba model through two strategies: zero-shot merging and partial visual fine-tuning. For evaluation, we introduce the AVSEC2-Gaze dataset. Experimental results show that GG-AVSE achieves substantial performance gains over gaze-free baselines: a 10.08% improvement in PESQ (2.370 to 2.609), a 5.18% improvement in STOI (0.8802 to 0.9258), and a 23.69%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.