Efficient Sequence Transduction by Jointly Predicting Tokens and   Durations

Hainan Xu; Fei Jia; Somshubra Majumdar; He Huang; Shinji Watanabe,; Boris Ginsburg

arXiv:2304.06795·eess.AS·May 31, 2023·6 cites

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe,, Boris Ginsburg

PDF

Open Access 4 Repos 10 Models

TL;DR

This paper presents the Token-and-Duration Transducer (TDT), a novel architecture that jointly predicts tokens and their durations, enabling faster and more accurate sequence transduction in speech recognition, translation, and intent classification.

Contribution

The paper introduces TDT, a new model that jointly predicts tokens and durations, improving speed and accuracy over traditional transducers in various sequence tasks.

Findings

01

TDT achieves up to 2.82X faster inference in speech recognition.

02

TDT improves BLEU scores by over 1 point in speech translation.

03

TDT enhances intent accuracy by over 1% and runs faster in intent classification.

Abstract

This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions over tokens and durations. During inference, TDT models can skip input frames guided by the predicted duration output, which makes them significantly faster than conventional Transducers which process the encoder output frame by frame. TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. TDT models for Speech Recognition achieve better accuracy and up to 2.82X faster inference than conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsTest