Contrastive Siamese Network for Semi-supervised Speech Recognition
Soheil Khorram, Jaeyoung Kim, Anshuman Tripathi, Han Lu, Qian Zhang,, Hasim Sak

TL;DR
This paper presents a contrastive siamese network architecture that effectively leverages unlabeled speech data to improve speech recognition accuracy, achieving significant WER reductions and competitive results with fewer parameters.
Contribution
The paper introduces the first contrastive siamese network for speech recognition that extracts high-level linguistic features from unlabeled data using novel training strategies.
Findings
20% relative WER improvement over wav2vec baselines
Competitive results with fewer parameters than state-of-the-art models
Effective use of unlabeled data in semi-supervised speech recognition
Abstract
This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
