Listen, Attend, Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels
Yacouba Diarra, Michael Leventhal

TL;DR
This paper introduces LAU, a regularization method for end-to-end speech translation that improves semantic preservation and training stability, especially with limited or noisy data, by constraining the acoustic encoder's latent space.
Contribution
LAU is a novel semantic regularization technique that uses frozen text embeddings to guide the acoustic encoder without increasing inference cost.
Findings
LAU achieves comparable performance to larger-data models.
LAU better preserves semantic meaning in translation.
Total Parameter Drift quantifies structural encoder changes due to regularization.
Abstract
End-to-End Speech Translation often shows slower convergence and worse performance when target transcriptions exhibit high variance and semantic ambiguity. We propose Listen, Attend, Understand (LAU), a semantic regularization technique that constrains the acoustic encoder's latent space during training. By leveraging frozen text embeddings to provide a directional auxiliary loss, LAU injects linguistic groundedness into the acoustic representation without increasing inference cost. We evaluate our method on a Bambara-to-French dataset with 30 hours of Bambara speech translated by non-professionals. Experimental results demonstrate that LAU models achieve comparable performance by standard metrics compared to an E2E-ST system pretrained with 100\% more data and while performing better in preserving semantic meaning. Furthermore, we introduce Total Parameter Drift as a metric to quantify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
