SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition
Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen,, Youzheng Wu, Xiaodong He

TL;DR
This paper introduces SCaLa, a supervised contrastive learning framework that improves phonemic representation in end-to-end speech recognition, reducing errors by leveraging phoneme boundaries for better feature learning.
Contribution
The paper extends self-supervised contrastive coding to a supervised setting using phoneme boundaries, enhancing phonemic discrimination in ASR models.
Findings
Achieved 2.8 and 1.4 points CER reduction on reading and spontaneous speech datasets.
Utilized phoneme forced-alignment to guide contrastive learning, improving recognition accuracy.
Demonstrated effectiveness of supervised contrastive learning in end-to-end ASR systems.
Abstract
End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsContrastive Learning · InfoNCE · Contrastive Predictive Coding
