CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations
Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar,, Panayiotis Georgiou

TL;DR
CALM introduces a contrastive, multirate approach to learn aligned audio and language representations, achieving competitive performance in emotion recognition with efficient training.
Contribution
The paper presents CALM, a novel method for aligning audio and lexical representations using contrastive learning and multirate processing within a pretrained language model.
Findings
Achieves 10-25% improvement in emotion recognition accuracy.
Aligns audio and language embeddings effectively with minimal training time.
Demonstrates benefits of multirate pretraining for multimodal tasks.
Abstract
Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acoustic and lexical information in the input embedding space of a pretrained language-only contextual embedding model. By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time. Operationally, audio spectrograms are processed using linearized patches through a Spectral Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Softmax · Byte Pair Encoding · Layer Normalization · Dropout · Label Smoothing
