Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model   for Telephonic-Speech ASR

Vrunda N. Sukhadia; A. Arunkumar; S. Umesh

arXiv:2211.01669·eess.AS·June 6, 2023

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

Vrunda N. Sukhadia, A. Arunkumar, S. Umesh

PDF

Open Access

TL;DR

This paper introduces a channel-aware pretraining method for joint encoder-decoder self-supervised models to improve telephonic speech ASR by incorporating channel information through non-overlapping cluster IDs, achieving notable performance gains.

Contribution

It proposes a novel channel-aware clustering approach for joint encoder-decoder models, enhancing downstream ASR performance on multi-channel speech data.

Findings

01

Achieves ~4% relative improvement over baseline pooling method.

02

Incorporating channel information improves model performance.

03

Method effectively leverages multi-channel speech data for better ASR results.

Abstract

This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint encoder-decoder self-supervised model extends the HuBERT model with a Transformer decoder. HuBERT performs clustering of features and predicts the class of every input frame. In simple pooling, which is our baseline, there is no way to identify the channel information. To incorporate channel information, we have proposed non-overlapping cluster IDs for speech from different channels. Our method gives a relative improvement of ~4% over the joint encoder-decoder self-supervised model built with simple pooling of data, which serves as our baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Label Smoothing · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings · Layer Normalization