Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Tatsuya Komatsu

arXiv:2202.08474·eess.AS·February 18, 2022

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

Tatsuya Komatsu

PDF

Open Access

TL;DR

This paper introduces a parameter-efficient non-autoregressive speech recognition model using self-conditioned folded encoders, achieving comparable or better performance than traditional models with fewer parameters.

Contribution

It proposes a novel folded encoder architecture with self-conditioning and CTC loss, reducing parameters while maintaining or improving recognition accuracy.

Findings

01

Achieves comparable performance with 38% of parameters of conventional models.

02

Outperforms traditional models when increasing iterations.

03

Demonstrates effective parameter reduction without sacrificing accuracy.

Abstract

This paper proposes CTC-based non-autoregressive ASR with self-conditioned folded encoders. The proposed method realizes non-autoregressive ASR with fewer parameters by folding the conventional stack of encoders into only two blocks; base encoders and folded encoders. The base encoders convert the input audio features into a neural representation suitable for recognition. This is followed by the folded encoders applied repeatedly for further refinement. Applying the CTC loss to the outputs of all encoders enforces the consistency of the input-output relationship. Thus, folded encoders learn to perform the same operations as an encoder with deeper distinct layers. In experiments, we investigate how to set the number of layers and the number of iterations for the base and folded encoders. The results show that the proposed method achieves a performance comparable to that of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsConnectionist Temporal Classification Loss · Balanced Selection