Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Ryan Soh-Eun Shim; Domenico De Cristofaro; Chengzhi Martin Hu; Alessandro Vietti; Barbara Plank

arXiv:2505.19606·cs.CL·April 7, 2026

Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Ryan Soh-Eun Shim, Domenico De Cristofaro, Chengzhi Martin Hu, Alessandro Vietti, Barbara Plank

PDF

TL;DR

This paper investigates whether cross-lingual alignment in Whisper-style speech encoders is driven by semantics rather than phonetic similarity, demonstrating semantic alignment persists without phonetic cues.

Contribution

The study provides evidence that Whisper-style speech encoders align languages semantically, not just phonetically, especially in models trained with translation objectives and through early-exiting techniques.

Findings

01

Spoken translation retrieval remains above chance without phonetic cues.

02

Semantic alignment persists in final encoder layers trained with translation.

03

Early-exiting the encoder improves speech recognition for low-resource languages.

Abstract

Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational similarity. However, prior work does not control for phonetic overlap between equivalent utterances, which may artificially support retrieval. We conduct pronunciation-controlled experiments to test whether cross-lingual alignment arises from semantic rather than phonetic similarity. Results show that spoken translation retrieval remains strongly above chance without phonetic cues in the final layers of encoders trained with a speech translation objective, most clearly for models additionally trained on translation. We further test early-exiting the encoder to induce representations we hypothesize to be less tied to language-specific semantics. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.