Learning Speech Representation From Contrastive Token-Acoustic Pretraining
Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang,, Jianwu Dang

TL;DR
This paper introduces CTAP, a contrastive pretraining method that learns fine-grained speech representations by aligning phonemes and speech at the frame level, improving tasks like TTS, VC, and ASR.
Contribution
The paper presents a novel contrastive token-acoustic pretraining approach that effectively models frame-level speech-phoneme relationships, addressing redundancy and dimension issues in speech representations.
Findings
Achieved minimally-supervised TTS, VC, and ASR with 210k speech-phoneme pairs
Demonstrated improved fine-grained speech representation learning
Provided audio samples on a dedicated website
Abstract
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsFocus · Contrastive Learning
