CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Liangbin Huang; Xiaohua Liao; Chaoqun Cui; Shijing Wang; Zhaolong Huang; Yanlong Du; Wenji Mao

arXiv:2603.16966·cs.CV·March 19, 2026

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Liangbin Huang, Xiaohua Liao, Chaoqun Cui, Shijing Wang, Zhaolong Huang, Yanlong Du, Wenji Mao

PDF

Open Access

TL;DR

This paper introduces CineSRD, a multimodal framework for open-world speaker diarization in complex visual media, integrating visual, acoustic, and linguistic cues to improve speaker annotation in unconstrained audiovisual content.

Contribution

We propose CineSRD, a novel unified multimodal approach that addresses challenges of open-world visual media diarization and provide a new benchmark for this task.

Findings

01

CineSRD outperforms existing methods on the new benchmark.

02

The framework effectively integrates visual, acoustic, and linguistic cues.

03

CineSRD achieves competitive results on traditional datasets.

Abstract

Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis