VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Chunyu Qiang; Wang Geng; Yi Zhao; Ruibo Fu; Tao Wang; Cheng Gong; Tianrui Wang; Qiuyu Liu; Jiangyan Yi; Zhengqi Wen; Chen Zhang; Hao Che; Longbiao Wang; Jianwu Dang; Jianhua Tao

arXiv:2408.05758·eess.AS·May 29, 2025

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

PDF

Open Access

TL;DR

VQ-CTAP introduces a cross-modal sequence learning framework that aligns text and speech at the frame level, enabling improved speech processing tasks like voice conversion and recognition without additional fine-tuning.

Contribution

The paper presents VQ-CTAP, a novel paradigm for fine-grained cross-modal sequence representation learning that integrates multiple pre-trained modules and introduces a semantic-transfer loss.

Findings

01

Enables direct application to VC and ASR without fine-tuning

02

Achieves high compression speech coding at 25Hz from 24kHz input

03

Demonstrates improved generalization to unseen data

Abstract

Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems