Listen, Attend, Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels

Yacouba Diarra; Michael Leventhal

arXiv:2601.01121·cs.CL·January 6, 2026

Listen, Attend, Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels

Yacouba Diarra, Michael Leventhal

PDF

Open Access 1 Datasets

TL;DR

This paper introduces LAU, a regularization method for end-to-end speech translation that improves semantic preservation and training stability, especially with limited or noisy data, by constraining the acoustic encoder's latent space.

Contribution

LAU is a novel semantic regularization technique that uses frozen text embeddings to guide the acoustic encoder without increasing inference cost.

Findings

01

LAU achieves comparable performance to larger-data models.

02

LAU better preserves semantic meaning in translation.

03

Total Parameter Drift quantifies structural encoder changes due to regularization.

Abstract

End-to-End Speech Translation often shows slower convergence and worse performance when target transcriptions exhibit high variance and semantic ambiguity. We propose Listen, Attend, Understand (LAU), a semantic regularization technique that constrains the acoustic encoder's latent space during training. By leveraging frozen text embeddings to provide a directional auxiliary loss, LAU injects linguistic groundedness into the acoustic representation without increasing inference cost. We evaluate our method on a Bambara-to-French dataset with 30 hours of Bambara speech translated by non-professionals. Experimental results demonstrate that LAU models achieve comparable performance by standard metrics compared to an E2E-ST system pretrained with 100\% more data and while performing better in preserving semantic meaning. Furthermore, we introduce Total Parameter Drift as a metric to quantify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RobotsMali/lau-eval
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling