Hybrid Transducer and Attention based Encoder-Decoder Modeling for   Speech-to-Text Tasks

Yun Tang; Anna Y. Sun; Hirofumi Inaguma; Xinyue Chen; Ning Dong; Xutai; Ma; Paden D. Tomasello; Juan Pino

arXiv:2305.03101·cs.CL·May 8, 2023·1 cites

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

Yun Tang, Anna Y. Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai, Ma, Paden D. Tomasello, Juan Pino

PDF

Open Access

TL;DR

This paper introduces a hybrid speech-to-text model combining Transducer and Attention Encoder-Decoder methods, leveraging their strengths for improved offline and streaming ASR and translation performance.

Contribution

The paper proposes a novel combined framework, TAED, sharing a speech encoder and integrating Transducer and AED components for enhanced speech-to-text tasks.

Findings

01

TAED outperforms Transducer in offline ASR and ST tasks.

02

In streaming scenarios, TAED surpasses Transducer in ASR and one translation direction.

03

Achieves comparable results in another translation direction.

Abstract

Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new method leverages AED's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property. In the proposed framework, Transducer and AED share the same speech encoder. The predictor in Transducer is replaced by the decoder in the AED model, and the outputs of the decoder are conditioned on the speech inputs instead of outputs from an unconditioned language model. The proposed solution ensures that the model is optimized by covering all possible read/write…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling