Learning Speech Representation From Contrastive Token-Acoustic   Pretraining

Chunyu Qiang; Hao Li; Yixin Tian; Ruibo Fu; Tao Wang; Longbiao Wang,; Jianwu Dang

arXiv:2309.00424·eess.AS·December 19, 2023

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang,, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces CTAP, a contrastive pretraining method that learns fine-grained speech representations by aligning phonemes and speech at the frame level, improving tasks like TTS, VC, and ASR.

Contribution

The paper presents a novel contrastive token-acoustic pretraining approach that effectively models frame-level speech-phoneme relationships, addressing redundancy and dimension issues in speech representations.

Findings

01

Achieved minimally-supervised TTS, VC, and ASR with 210k speech-phoneme pairs

02

Demonstrated improved fine-grained speech representation learning

03

Provided audio samples on a dedicated website

Abstract

For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsFocus · Contrastive Learning