LLaST: Improved End-to-end Speech Translation System Leveraged by Large   Language Models

Xi Chen; Songyang Zhang; Qibing Bai; Kai Chen; Satoshi Nakamura

arXiv:2407.15415·cs.CL·July 23, 2024·1 cites

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura

PDF

Open Access 1 Repo 1 Video

TL;DR

LLaST introduces a novel LLM-based speech translation framework that enhances end-to-end speech translation performance through architecture design and optimization, setting new benchmarks and scaling effectively.

Contribution

The paper presents a new LLM-based speech translation architecture with innovative training and optimization techniques, improving performance and scalability over existing models.

Findings

01

Superior performance on CoVoST-2 benchmark

02

Effective scaling capabilities with LLMs

03

Provides a strong baseline for future speech translation research

Abstract

We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in https://github.com/openaudiolab/LLaST.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openaudiolab/llast
pytorchOfficial

Videos

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis