CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Hang Li, Yu Kang, Tianqiao Liu, Wenbiao Ding, Zitao Liu

TL;DR
This paper introduces CTAL, a cross-modal transformer pre-training approach for audio and language that enhances performance on multiple downstream tasks by learning intra- and inter-modality connections.
Contribution
The paper presents a novel pre-training framework with a specialized fusion mechanism for audio-language tasks, improving over existing methods in generalization and performance.
Findings
Significant improvements in emotion classification, sentiment analysis, and speaker verification.
The proposed fusion mechanism enhances fine-tuning performance.
Ablation studies confirm the effectiveness of the cross-modality pre-training and fusion components.
Abstract
Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · CTAL · Adam · Dropout · Softmax · Residual Connection
