CTAL: Pre-training Cross-modal Transformer for Audio-and-Language   Representations

Hang Li; Yu Kang; Tianqiao Liu; Wenbiao Ding; Zitao Liu

arXiv:2109.00181·cs.SD·September 2, 2021·5 cites

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Hang Li, Yu Kang, Tianqiao Liu, Wenbiao Ding, Zitao Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces CTAL, a cross-modal transformer pre-training approach for audio and language that enhances performance on multiple downstream tasks by learning intra- and inter-modality connections.

Contribution

The paper presents a novel pre-training framework with a specialized fusion mechanism for audio-language tasks, improving over existing methods in generalization and performance.

Findings

01

Significant improvements in emotion classification, sentiment analysis, and speaker verification.

02

The proposed fusion mechanism enhances fine-tuning performance.

03

Ablation studies confirm the effectiveness of the cross-modality pre-training and fusion components.

Abstract

Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ydkwim/ctal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · CTAL · Adam · Dropout · Softmax · Residual Connection