Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework
Hsiang-Cheng Yang, You-Jin Li, Rong Chao, Yu Tsao, Borching Su, Shao-Yi Chien

TL;DR
This paper introduces a gaze-guided audio-visual speech enhancement framework that improves target speaker identification in multi-talker environments by leveraging gaze direction cues, demonstrating significant performance gains.
Contribution
The paper proposes a novel GG-AVSE framework combining gaze signals with AVSE models, and introduces the AVSEC2-Gaze dataset for evaluation.
Findings
Achieved 10.08% improvement in PESQ
Achieved 5.18% improvement in STOI
Achieved 23.69% improvement in SI-SDR
Abstract
This paper presents a Gaze-Guided Audio-Visual Speech Enhancement (GG-AVSE) framework to address the cocktail party problem. A major challenge in conventional AVSE is identifying the listener's intended speaker in multi-talker environments. GG-AVSE addresses this issue by exploiting gaze direction as a supervisory cue for target-speaker selection. Specifically, we propose the GG-VM module, which combines gaze signals with a YOLO5Face detector to extract the target speaker's facial features and integrates them with the pretrained AVSEMamba model through two strategies: zero-shot merging and partial visual fine-tuning. For evaluation, we introduce the AVSEC2-Gaze dataset. Experimental results show that GG-AVSE achieves substantial performance gains over gaze-free baselines: a 10.08% improvement in PESQ (2.370 to 2.609), a 5.18% improvement in STOI (0.8802 to 0.9258), and a 23.69%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
