SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition

Li Fu; Xiaoxiao Li; Runyu Wang; Lu Fan; Zhengchen Zhang; Meng Chen,; Youzheng Wu; Xiaodong He

arXiv:2110.04187·eess.AS·June 22, 2022

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition

Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen,, Youzheng Wu, Xiaodong He

PDF

Open Access

TL;DR

This paper introduces SCaLa, a supervised contrastive learning framework that improves phonemic representation in end-to-end speech recognition, reducing errors by leveraging phoneme boundaries for better feature learning.

Contribution

The paper extends self-supervised contrastive coding to a supervised setting using phoneme boundaries, enhancing phonemic discrimination in ASR models.

Findings

01

Achieved 2.8 and 1.4 points CER reduction on reading and spontaneous speech datasets.

02

Utilized phoneme forced-alignment to guide contrastive learning, improving recognition accuracy.

03

Demonstrated effectiveness of supervised contrastive learning in end-to-end ASR systems.

Abstract

End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsContrastive Learning · InfoNCE · Contrastive Predictive Coding