Deep Multimodal Speaker Naming
Yongtao Hu, Jimmy Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping, Wang

TL;DR
This paper introduces a CNN-based framework for automatic speaker naming in videos that effectively fuses face and audio cues, achieving state-of-the-art results without relying on face tracking or transcripts.
Contribution
The paper presents a novel deep learning approach that automatically learns multimodal feature fusion for speaker naming, outperforming previous heuristic-based methods.
Findings
Achieves state-of-the-art speaker naming accuracy on TV series datasets.
Does not require face tracking, facial landmarks, or transcripts.
Demonstrates robustness across diverse video scenes.
Abstract
Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
