CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi, Sun, Hui Zhan, Helen Meng

TL;DR
This paper introduces CALM, a contrastive learning-based module that improves reference speech selection for expressive TTS by extracting style-related text features, leading to more accurate and expressive synthesized speech.
Contribution
The paper proposes CALM, a novel contrastive acoustic-linguistic module that enhances reference speech selection by focusing on style-related text features, improving expressive TTS synthesis.
Findings
CALM outperforms baseline methods in objective evaluations.
Subjective listening tests favor CALM-enhanced synthesis.
The approach effectively isolates style-related information from text.
Abstract
To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
