Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS

Kirill Borodin; Vasiliy Kudryavtsev; Maxim Maslov; Nikita Vasiliev; Mikhail Gorodnichev; Grach Mkrtchian

arXiv:2604.01247·cs.SD·April 3, 2026

Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS

Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov, Nikita Vasiliev, Mikhail Gorodnichev, Grach Mkrtchian

PDF

TL;DR

This paper explores multi-stage pretraining for prosody modeling in diffusion-based TTS, combining masked language modeling and cross-modal contrastive learning to improve synthesis quality.

Contribution

It introduces a dual-stream encoder trained with MLM and contrastive learning, revealing insights into balancing phoneme discrimination and prosodic sensitivity.

Findings

01

Two-stage curriculum improves synthesis quality in TTS

02

Same-phoneme refinement enhances prosodic retrieval but degrades synthesis

03

Embedding metrics do not always correlate with generative performance

Abstract

We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.