RealTranS: End-to-End Simultaneous Speech Translation with Convolutional   Weighted-Shrinking Transformer

Xingshan Zeng; Liangyou Li; Qun Liu

arXiv:2106.04833·cs.CL·June 10, 2021

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Xingshan Zeng, Liangyou Li, Qun Liu

PDF

Open Access

TL;DR

RealTranS is an innovative end-to-end model for simultaneous speech translation that effectively bridges speech and text modalities, improving real-time translation quality and latency handling.

Contribution

It introduces a novel architecture with convolutional and Transformer layers, a weighted-shrinking operation, and strategies for enhanced performance in simultaneous translation.

Findings

01

Outperforms prior end-to-end models in diverse latency settings.

02

Effective in real-time translation with improved accuracy.

03

Demonstrates robustness across multiple datasets.

Abstract

End-to-end simultaneous speech translation (SST), which directly translates speech in one language into text in another language in real-time, is useful in many scenarios but has not been fully investigated. In this work, we propose RealTranS, an end-to-end model for SST. To bridge the modality gap between speech and text, RealTranS gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Besides, to improve the model performance in simultaneous scenarios, we propose a blank penalty to enhance the shrinking quality and a Wait-K-Stride-N strategy to allow local reranking during decoding. Experiments on public and widely-used datasets show that RealTranS with the Wait-K-Stride-N strategy outperforms prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Convolution · Residual Connection