Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi; Javier Garcia Gilabert; Zachary Hopton; Vil\'em Zouhar; Carlos Escolano; Gerard I. G\'allego; Jorge Iranzo-S\'anchez; Ahrii Kim; Dominik Mach\'a\v{c}ek; Patricia Schmidtova; Maike Z\"ufle

arXiv:2512.16378·cs.CL·April 28, 2026

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vil\'em Zouhar, Carlos Escolano, Gerard I. G\'allego, Jorge Iranzo-S\'anchez, Ahrii Kim, Dominik Mach\'a\v{c}ek, Patricia Schmidtova, Maike Z\"ufle

PDF

1 Repo 2 Datasets

TL;DR

This paper benchmarks SpeechLLMs against traditional systems for speech translation, finding that while cascades are generally reliable, recent SpeechLLMs can match or outperform them in certain conditions, emphasizing the importance of LLM integration.

Contribution

It introduces Hearing to Translate, the first comprehensive benchmark suite evaluating SpeechLLMs against strong baselines across diverse conditions and languages.

Findings

01

Cascaded systems are generally more reliable overall.

02

Recent SpeechLLMs can match or outperform cascades in various settings.

03

Integrating an LLM is crucial for high-quality speech translation.

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarapapi/hearing2translate
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.