Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

Gerard I. G\'allego; Oriol Pareras; Mart\'i Cortada Garcia; Lucas Takanori; Javier Hernando

arXiv:2505.24691·cs.CL·September 30, 2025

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

Gerard I. G\'allego, Oriol Pareras, Mart\'i Cortada Garcia, Lucas Takanori, Javier Hernando

PDF

TL;DR

This paper introduces a phoneme-augmented Chain-of-Thought framework for speech-to-text translation, significantly improving low-resource and zero-resource language translation by leveraging phoneme recognition and curriculum learning.

Contribution

It presents a novel integration of phoneme representations into a multilingual CoT framework, enhancing cross-lingual transfer in low-resource scenarios.

Findings

01

Improves translation quality in low-resource settings.

02

Enables zero-resource translation for unseen languages.

03

Slightly reduces performance on high-resource languages.

Abstract

We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.