KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025
Sai Koneru, Maike Z\"ufle, Thai-Binh Nguyen, Seymanur Akti, Jan Niehues, Alexander Waibel

TL;DR
This paper presents KIT's submissions for IWSLT 2025, utilizing large language models to improve offline speech translation and instruction following through multi-stage pipelines and end-to-end models with contextual refinement.
Contribution
It introduces novel pipeline and end-to-end models that leverage LLMs for enhanced speech translation and instruction following, incorporating document-level context and refinement stages.
Findings
Improved translation quality with LLM-based fusion and refinement.
Effective instruction following with integrated speech encoder and LLM.
Enhanced performance through contextual document-level processing.
Abstract
The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
