Investigating Decoder-only Large Language Models for Speech-to-text Translation
Chao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov,, Ruslan Mavlyutov, Sravya Popuri

TL;DR
This paper explores the use of decoder-only large language models for speech-to-text translation, achieving state-of-the-art results without proprietary data and analyzing various fine-tuning techniques.
Contribution
It introduces a decoder-only architecture for S2TT and evaluates parameter-efficient fine-tuning methods, advancing LLM application in speech translation.
Findings
Achieves state-of-the-art performance on CoVoST 2 and FLEURS datasets.
Demonstrates effectiveness of parameter-efficient fine-tuning techniques.
Provides insights into model design choices for speech-to-text translation.
Abstract
Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Additionally, we investigate the effects of different parameter-efficient fine-tuning techniques and task formulation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data. We also conduct analyses to validate the design choices of our proposed model and bring insights to the integration of LLMs to S2TT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
