Target Speech Diarization with Multimodal Prompts

Yidi Jiang; Ruijie Tao; Zhengyang Chen; Yanmin Qian; Haizhou Li

arXiv:2406.07198·eess.AS·June 12, 2024·1 cites

Target Speech Diarization with Multimodal Prompts

Yidi Jiang, Ruijie Tao, Zhengyang Chen, Yanmin Qian, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces MM-TSD, a flexible multimodal framework for target speech diarization that uses diverse prompts and a voice-face aligner, achieving robust performance in complex real-world scenarios.

Contribution

The paper presents a novel multimodal prompt-based framework for target speech diarization, including a voice-face aligner and a new dataset, enabling versatile and effective speech event detection.

Findings

01

Achieves robust performance comparable to specialized models

02

Handles complex real-world conversations effectively

03

Demonstrates versatility across multiple signal processing tasks

Abstract

Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts. We further propose a voice-face aligner module to project human voice and face representation into a shared space. We develop a multi-modal dataset based on VoxCeleb2 for MM-TSD training and evaluation. Additionally, we conduct comparative analysis and ablation studies for each category of prompts to validate the efficacy of each component in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis