Tuning Large language model for End-to-end Speech Translation

Hao Zhang; Nianwen Si; Yaqi Chen; Wenlin Zhang; Xukui Yang; Dan Qu,; Xiaolin Jiao

arXiv:2310.02050·cs.CL·October 4, 2023·2 cites

Tuning Large language model for End-to-end Speech Translation

Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu,, Xiaolin Jiao

PDF

Open Access

TL;DR

This paper presents LST, a large multimodal model optimized for end-to-end speech translation, achieving state-of-the-art BLEU scores on MuST-C benchmark through a two-stage training process.

Contribution

Introduces LST, a novel multimodal model with a specialized training strategy for improved speech translation performance.

Findings

01

LST-13B surpasses previous models on MuST-C benchmark.

02

Two-stage training effectively aligns speech and text representations.

03

Achieves state-of-the-art BLEU scores for multiple language pairs.

Abstract

With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal translation task. In comparison to single-modal models, multimodal models lag behind in these scenarios. This paper introduces LST, a Large multimodal model designed to excel at the E2E-ST task. LST consists of a speech frontend, an adapter, and a LLM backend. The training of LST consists of two stages: (1) Modality adjustment, where the adapter is tuned to align speech representation with text embedding space, and (2) Downstream task fine-tuning, where both the adapter and LLM model are trained to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsALIGN · Adapter