Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe,, Boris Ginsburg

TL;DR
This paper presents the Token-and-Duration Transducer (TDT), a novel architecture that jointly predicts tokens and their durations, enabling faster and more accurate sequence transduction in speech recognition, translation, and intent classification.
Contribution
The paper introduces TDT, a new model that jointly predicts tokens and durations, improving speed and accuracy over traditional transducers in various sequence tasks.
Findings
TDT achieves up to 2.82X faster inference in speech recognition.
TDT improves BLEU scores by over 1 point in speech translation.
TDT enhances intent accuracy by over 1% and runs faster in intent classification.
Abstract
This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions over tokens and durations. During inference, TDT models can skip input frames guided by the predicted duration output, which makes them significantly faster than conventional Transducers which process the encoder output frame by frame. TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. TDT models for Speech Recognition achieve better accuracy and up to 2.82X faster inference than conventional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/parakeet-tdt-0.6b-v3model· 254k dl· ♡ 747254k dl♡ 747
- 🤗nvidia/parakeet-tdt-0.6b-v2model· 164k dl· ♡ 1444164k dl♡ 1444
- 🤗nvidia/parakeet-tdt-1.1bmodel· 11k dl· ♡ 11411k dl♡ 114
- 🤗nvidia/parakeet-tdt_ctc-0.6b-jamodel· 11k dl· ♡ 4911k dl♡ 49
- 🤗nvidia/parakeet-tdt_ctc-110mmodel· 6.0k dl· ♡ 406.0k dl♡ 40
- 🤗nvidia/parakeet-tdt_ctc-1.1bmodel· 6.2k dl· ♡ 226.2k dl♡ 22
- 🤗SoSolaris/parakeet-tdt-0.6b-v3model· 7 dl7 dl
- 🤗ManuelZnnmc/parakeet-tdt-0.6b-v3model· 1 dl1 dl
- 🤗MadnessOverflow/parakeet-tdt-0.6b-v3-bpe-vocabmodel
- 🤗Endy2001/parakeet-tdt-0.6b-v3model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsTest
