ESPnet-ST IWSLT 2021 Offline Speech Translation System

Hirofumi Inaguma; Brian Yan; Siddharth Dalmia; Pengcheng Guo; Jiatong; Shi; Kevin Duh; Shinji Watanabe

arXiv:2107.00636·eess.AS·July 7, 2021·1 cites

ESPnet-ST IWSLT 2021 Offline Speech Translation System

Hirofumi Inaguma, Brian Yan, Siddharth Dalmia, Pengcheng Guo, Jiatong, Shi, Kevin Duh, Shinji Watanabe

PDF

Open Access

TL;DR

This paper presents an advanced offline speech translation system for IWSLT 2021, utilizing data augmentation, novel architecture, and improved segmentation to achieve state-of-the-art BLEU scores.

Contribution

It introduces multi-referenced sequence-level knowledge distillation, a Conformer encoder with multi-decoder architecture, and enhanced audio segmentation techniques.

Findings

01

Achieved 31.4 BLEU on tst2021 2-reference test set.

02

Significant improvements from data, architecture, and segmentation methods.

03

Model ensembling further boosted translation performance.

Abstract

This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsKnowledge Distillation