Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Daria Boratyn, Damian Brzyski, Albert Le\'sniak, Wojciech {\L}ukasik, Maciej Rapacz, Jan Rybicki, Wojciech S{\l}omczy\'nski, Dariusz Stolicki

TL;DR
This study examines whether cosine similarity between paragraph embeddings remains stable under machine translation across multiple languages, revealing that translation preserves semantic structure in some languages but distorts it in others.
Contribution
The paper introduces a novel framework for testing translation invariance of semantic similarity measures, applicable across different corpora and embedding models.
Findings
Ten languages show translation invariance in semantic similarity.
Four languages exhibit detectable semantic distortion after translation.
The framework can be extended to downstream NLP tasks.
Abstract
We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
