ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text   Translation

Chenyang Le; Yao Qian; Long Zhou; Shujie Liu; Yanmin Qian; Michael; Zeng; Xuedong Huang

arXiv:2305.14838·cs.CL·October 17, 2023·2 cites

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael, Zeng, Xuedong Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

ComSL is a novel composite speech-language model that efficiently combines pretrained speech and language models through multi-task learning, achieving state-of-the-art results in multilingual speech-to-text translation.

Contribution

It introduces a composite architecture with cross-modality transfer learning for end-to-end speech translation, reducing data and computational demands.

Findings

01

Achieved a new state-of-the-art BLEU score of 31.5 on CoVoST2.

02

Effectively integrates speech and language models for multilingual translation.

03

Demonstrated data-efficient training with improved performance.

Abstract

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nethermanpro/comsl
pytorchOfficial

Videos

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling