Blockwise Streaming Transformer for Spoken Language Understanding and   Simultaneous Speech Translation

Keqi Deng; Shinji Watanabe; Jiatong Shi; Siddhant Arora

arXiv:2204.08920·cs.CL·April 20, 2022

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation

Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora

PDF

Open Access

TL;DR

This paper introduces a blockwise streaming Transformer for real-time spoken language understanding and speech translation, achieving competitive performance with novel techniques like intermediate loss regularization and cross-lingual encoding.

Contribution

It presents the first streaming Transformer model for SLU and simultaneous speech translation, incorporating new regularization and cross-lingual encoding methods.

Findings

01

Achieves 2.4% accuracy improvement on SLU tasks.

02

Attains 4.3 BLEU score increase on speech translation.

03

Performs comparably to offline models in streaming scenarios.

Abstract

Although Transformers have gained success in several speech processing tasks like spoken language understanding (SLU) and speech translation (ST), achieving online processing while keeping competitive performance is still essential for real-world interaction. In this paper, we take the first step on streaming SLU and simultaneous ST using a blockwise streaming Transformer, which is based on contextual block processing and blockwise synchronous beam search. Furthermore, we design an automatic speech recognition (ASR)-based intermediate loss regularization for the streaming SLU task to improve the classification performance further. As for the simultaneous ST task, we propose a cross-lingual encoding method, which employs a CTC branch optimized with target language translations. In addition, the CTC translation output is also used to refine the search space with CTC prefix score,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Adam · Multi-Head Attention · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections