SAN: a robust end-to-end ASR model architecture

Zeping Min; Qian Ge; Guanhua Huang

arXiv:2210.15285·cs.SD·October 28, 2022

SAN: a robust end-to-end ASR model architecture

Zeping Min, Qian Ge, Guanhua Huang

PDF

Open Access

TL;DR

This paper introduces SAN, a novel Siamese Adversarial Network architecture for robust end-to-end automatic speech recognition, significantly improving accuracy on fuzzy audio inputs and achieving state-of-the-art results.

Contribution

The paper presents a new SAN architecture that leverages adversarial learning to enhance acoustic feature extraction in ASR, especially for fuzzy audio, outperforming existing models.

Findings

01

Achieved a 4.37 CER on AISHELL-1 without language model

02

Reduced CER by around 5% relative on AISHELL-1

03

Demonstrated the model's effectiveness on phoneme recognition

Abstract

In this paper, we propose a novel Siamese Adversarial Network (SAN) architecture for automatic speech recognition, which aims at solving the difficulty of fuzzy audio recognition. Specifically, SAN constructs two sub-networks to differentiate the audio feature input and then introduces a loss to unify the output distribution of these sub-networks. Adversarial learning enables the network to capture more essential acoustic features and helps the models achieve better performance when encountering fuzzy audio input. We conduct numerical experiments with the SAN model on several datasets for the automatic speech recognition task. All experimental results show that the siamese adversarial nets significantly reduce the character error rate (CER). Specifically, we achieve a new state of art 4.37 CER without language model on the AISHELL-1 dataset, which leads to around 5% relative CER…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing