Letter-Based Speech Recognition with Gated ConvNets

Vitaliy Liptchinsky; Gabriel Synnaeve; Ronan Collobert

arXiv:1712.09444·cs.CL·February 19, 2019·34 cites

Letter-Based Speech Recognition with Gated ConvNets

Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

PDF

Open Access 2 Repos

TL;DR

This paper introduces a letter-based speech recognition system using Gated ConvNets, achieving competitive results on WSJ and LibriSpeech benchmarks without relying on traditional phoneme-based models.

Contribution

It presents a novel ConvNet architecture with Gated Linear Units for letter-based speech recognition, demonstrating strong performance with CTC and ASG training methods.

Findings

01

Matches best letter-based systems on WSJ

02

Achieves near state-of-the-art on LibriSpeech

03

Utilizes Gated ConvNets with high dropout

Abstract

In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forced alignment" step at training time. Phone-based approaches remain however state of the art on classical benchmarks. In this paper, we propose a letter-based speech recognition system, leveraging a ConvNet acoustic model. Key ingredients of the ConvNet are Gated Linear Units and high dropout. The ConvNet is trained to map audio sequences to their corresponding letter transcriptions, either via a classical CTC approach, or via a recent variant called ASG. Coupled with a simple decoder at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing