Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR
Vrunda N. Sukhadia, A. Arunkumar, S. Umesh

TL;DR
This paper introduces a channel-aware pretraining method for joint encoder-decoder self-supervised models to improve telephonic speech ASR by incorporating channel information through non-overlapping cluster IDs, achieving notable performance gains.
Contribution
It proposes a novel channel-aware clustering approach for joint encoder-decoder models, enhancing downstream ASR performance on multi-channel speech data.
Findings
Achieves ~4% relative improvement over baseline pooling method.
Incorporating channel information improves model performance.
Method effectively leverages multi-channel speech data for better ASR results.
Abstract
This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint encoder-decoder self-supervised model extends the HuBERT model with a Transformer decoder. HuBERT performs clustering of features and predicts the class of every input frame. In simple pooling, which is our baseline, there is no way to identify the channel information. To incorporate channel information, we have proposed non-overlapping cluster IDs for speech from different channels. Our method gives a relative improvement of ~4% over the joint encoder-decoder self-supervised model built with simple pooling of data, which serves as our baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Label Smoothing · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings · Layer Normalization
