fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Changhan Wang; Yun Tang; Xutai Ma; Anne Wu; Sravya Popuri; Dmytro; Okhonko; Juan Pino

arXiv:2010.05171·cs.CL·June 15, 2022·92 cites

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro, Okhonko, Juan Pino

PDF

Open Access 5 Repos 10 Models

TL;DR

fairseq S2T is a scalable, extensible extension of fairseq for speech-to-text tasks, supporting various model architectures and integrating translation and language models for comprehensive end-to-end workflows.

Contribution

It introduces a flexible fairseq extension for speech-to-text modeling with multiple architectures and seamless integration of translation and language models.

Findings

01

Supports RNN, Transformer, and Conformer models

02

Provides detailed training recipes and workflows

03

Enables multi-task and transfer learning

Abstract

We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation. It follows fairseq's careful design for scalability and extensibility. We provide end-to-end workflows from data pre-processing, model training to offline (online) inference. We implement state-of-the-art RNN-based, Transformer-based as well as Conformer-based models and open-source detailed training recipes. Fairseq's machine translation models and language models can be seamlessly integrated into S2T workflows for multi-task learning or transfer learning. Fairseq S2T documentation and examples are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis