Tuning Large language model for End-to-end Speech Translation
Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu,, Xiaolin Jiao

TL;DR
This paper presents LST, a large multimodal model optimized for end-to-end speech translation, achieving state-of-the-art BLEU scores on MuST-C benchmark through a two-stage training process.
Contribution
Introduces LST, a novel multimodal model with a specialized training strategy for improved speech translation performance.
Findings
LST-13B surpasses previous models on MuST-C benchmark.
Two-stage training effectively aligns speech and text representations.
Achieves state-of-the-art BLEU scores for multiple language pairs.
Abstract
With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal translation task. In comparison to single-modal models, multimodal models lag behind in these scenarios. This paper introduces LST, a Large multimodal model designed to excel at the E2E-ST task. LST consists of a speech frontend, an adapter, and a LLM backend. The training of LST consists of two stages: (1) Modality adjustment, where the adapter is tuned to align speech representation with text embedding space, and (2) Downstream task fine-tuning, where both the adapter and LLM model are trained to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsALIGN · Adapter
