Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in   End-to-End Speech-to-Intent Systems

Vishal Sunder; Eric Fosler-Lussier; Samuel Thomas; Hong-Kwang J. Kuo,; Brian Kingsbury

arXiv:2204.05188·cs.CL·July 4, 2022

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas, Hong-Kwang J. Kuo,, Brian Kingsbury

PDF

Open Access

TL;DR

This paper presents a novel tokenwise contrastive pretraining method that aligns speech and BERT embeddings at a fine-grained level, significantly improving end-to-end speech-to-intent understanding especially in noisy conditions.

Contribution

It introduces a tokenwise contrastive loss with cross-modal attention for more precise speech-BERT alignment in SLU systems, advancing pretraining techniques.

Findings

01

Achieves state-of-the-art intent recognition accuracy on two SLU datasets.

02

Improves robustness with SpecAugment, especially in noisy environments.

03

Demonstrates the effectiveness of token-level alignment over previous methods.

Abstract

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dropout · WordPiece · Adam · Dense Connections · Attention Dropout