Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora

TL;DR
This paper introduces a blockwise streaming Transformer for real-time spoken language understanding and speech translation, achieving competitive performance with novel techniques like intermediate loss regularization and cross-lingual encoding.
Contribution
It presents the first streaming Transformer model for SLU and simultaneous speech translation, incorporating new regularization and cross-lingual encoding methods.
Findings
Achieves 2.4% accuracy improvement on SLU tasks.
Attains 4.3 BLEU score increase on speech translation.
Performs comparably to offline models in streaming scenarios.
Abstract
Although Transformers have gained success in several speech processing tasks like spoken language understanding (SLU) and speech translation (ST), achieving online processing while keeping competitive performance is still essential for real-world interaction. In this paper, we take the first step on streaming SLU and simultaneous ST using a blockwise streaming Transformer, which is based on contextual block processing and blockwise synchronous beam search. Furthermore, we design an automatic speech recognition (ASR)-based intermediate loss regularization for the streaming SLU task to improve the classification performance further. As for the simultaneous ST task, we propose a cross-lingual encoding method, which employs a CTC branch optimized with target language translations. In addition, the CTC translation output is also used to refine the search space with CTC prefix score,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Adam · Multi-Head Attention · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections
