Target Speech Diarization with Multimodal Prompts
Yidi Jiang, Ruijie Tao, Zhengyang Chen, Yanmin Qian, Haizhou Li

TL;DR
This paper introduces MM-TSD, a flexible multimodal framework for target speech diarization that uses diverse prompts and a voice-face aligner, achieving robust performance in complex real-world scenarios.
Contribution
The paper presents a novel multimodal prompt-based framework for target speech diarization, including a voice-face aligner and a new dataset, enabling versatile and effective speech event detection.
Findings
Achieves robust performance comparable to specialized models
Handles complex real-world conversations effectively
Demonstrates versatility across multiple signal processing tasks
Abstract
Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts. We further propose a voice-face aligner module to project human voice and face representation into a shared space. We develop a multi-modal dataset based on VoxCeleb2 for MM-TSD training and evaluation. Additionally, we conduct comparative analysis and ablation studies for each category of prompts to validate the efficacy of each component in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
