LASER: Lip Landmark Assisted Speaker Detection for Robustness

Le Thien Phuc Nguyen; Zhuoran Yu; Yong Jae Lee

arXiv:2501.11899·cs.CV·November 27, 2025

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee

PDF

Open Access 1 Repo

TL;DR

LASER enhances active speaker detection by explicitly using lip landmarks during training, improving robustness against low resolution, occlusion, and background noise without requiring landmarks at test time.

Contribution

Introduces LASER, a novel method that incorporates lip landmarks into training for more robust speaker detection, and creates LASER-bench to evaluate performance under noisy conditions.

Findings

01

LASER outperforms state-of-the-art models on multiple benchmarks.

02

LASER improves detection accuracy in high-noise environments.

03

The auxiliary loss enhances robustness without increasing test-time complexity.

Abstract

Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model's attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

plnguyen2908/laser_asd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis

MethodsALIGN