$\text{M}^3\text{PDB}$: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation
Boyu Zhu, Cheng Gong, Muyang Wu, Ruihao Jing, Fan Liu, Xiaolei Zhang, Chi Zhang, Xuelong Li

TL;DR
This paper introduces $ ext{M}^3 ext{PDB}$, a large-scale, multilingual, multi-modal prompt database designed to improve zero-shot speech generation in real-world scenarios with diverse and imperfect prompts.
Contribution
We present the first multi-modal, multi-label, multilingual prompt database for speech generation and a prompt selection strategy optimized for real-time, resource-constrained environments.
Findings
The database supports diverse speech generation scenarios.
The prompt selection strategy improves robustness in challenging conditions.
Experimental results validate the effectiveness of the dataset and method.
Abstract
Recent advancements in zero-shot speech generation have enabled models to synthesize speech that mimics speaker identity and speaking style from speech prompts. However, these models' effectiveness is significantly limited in real-world scenarios where high-quality speech prompts are absent, incomplete, or out of domain. This issue arises primarily from a significant quality mismatch between the speech data utilized for model training and the input prompt speech during inference. To address this, we introduce , the first large-scale, multi-modal, multi-label, and multilingual prompt database designed for robust prompt selection in speech generation. Our dataset construction leverages a novel multi-modal, multi-agent annotation framework, enabling precise and hierarchical labeling across diverse modalities. Furthermore, we propose a lightweight yet effective prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
