JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT
Mayumi Ohta, Julia Kreutzer, Stefan Riezler

TL;DR
JoeyS2T is a minimalist, easy-to-use speech-to-text toolkit built on JoeyNMT, offering competitive performance with a simple, integrated workflow for speech recognition and translation tasks.
Contribution
It extends JoeyNMT with speech-specific components, creating a unified, accessible toolkit for speech-to-text modeling that maintains simplicity and competitive accuracy.
Findings
Performs competitively on speech recognition benchmarks.
Provides an integrated, easy-to-use pipeline from data to evaluation.
Maintains simplicity while including key speech modeling features.
Abstract
JoeyS2T is a JoeyNMT extension for speech-to-text tasks such as automatic speech recognition and end-to-end speech translation. It inherits the core philosophy of JoeyNMT, a minimalist NMT toolkit built on PyTorch, seeking simplicity and accessibility. JoeyS2T's workflow is self-contained, starting from data pre-processing, over model training and prediction to evaluation, and is seamlessly integrated into JoeyNMT's compact and simple code base. On top of JoeyNMT's state-of-the-art Transformer-based encoder-decoder architecture, JoeyS2T provides speech-oriented components such as convolutional layers, SpecAugment, CTC-loss, and WER evaluation. Despite its simplicity compared to prior implementations, JoeyS2T performs competitively on English speech recognition and English-to-German speech translation benchmarks. The implementation is accompanied by a walk-through tutorial and available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
