Contrastive Siamese Network for Semi-supervised Speech Recognition

Soheil Khorram; Jaeyoung Kim; Anshuman Tripathi; Han Lu; Qian Zhang,; Hasim Sak

arXiv:2205.14054·cs.LG·May 30, 2022

Contrastive Siamese Network for Semi-supervised Speech Recognition

Soheil Khorram, Jaeyoung Kim, Anshuman Tripathi, Han Lu, Qian Zhang,, Hasim Sak

PDF

Open Access

TL;DR

This paper presents a contrastive siamese network architecture that effectively leverages unlabeled speech data to improve speech recognition accuracy, achieving significant WER reductions and competitive results with fewer parameters.

Contribution

The paper introduces the first contrastive siamese network for speech recognition that extracts high-level linguistic features from unlabeled data using novel training strategies.

Findings

01

20% relative WER improvement over wav2vec baselines

02

Competitive results with fewer parameters than state-of-the-art models

03

Effective use of unlabeled data in semi-supervised speech recognition

Abstract

This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing