# Sequence-to-Sequence Speech Recognition with Time-Depth Separable   Convolutions

**Authors:** Awni Hannun, Ann Lee, Qiantong Xu, Ronan Collobert

arXiv: 1904.02619 · 2019-04-05

## TL;DR

This paper introduces a convolutional sequence-to-sequence speech recognition model using time-depth separable convolutions, achieving high accuracy and efficiency on LibriSpeech by reducing parameters and enabling effective language model integration.

## Contribution

The paper presents a novel time-depth separable convolution architecture for speech recognition that improves accuracy and efficiency over previous RNN-based models.

## Key findings

- Over 22% relative WER reduction on LibriSpeech test set.
- Model is an order of magnitude more efficient than RNN baselines.
- Effective integration of language models enhances performance.

## Abstract

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.02619/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1904.02619/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/1904.02619/full.md

---
Source: https://tomesphere.com/paper/1904.02619