Lingua Custodi's participation at the WMT 2025 Terminology shared task
Jingshu Liu, Raheel Qader, Ga\"etan Caillaut, Mariam Nakhl\'e

TL;DR
This paper explores multilingual sentence embeddings using BERT-based models, demonstrating significant reductions in training data needs and achieving state-of-the-art retrieval accuracy across 112 languages, with practical applications in translation.
Contribution
It introduces a novel multilingual sentence embedding approach that combines multiple methods, significantly reducing data requirements and outperforming existing models like LASER.
Findings
Achieves 83.7% bi-text retrieval accuracy over 112 languages.
Reduces parallel training data needs by 80%.
Enables training competitive NMT models for en-zh and en-de.
Abstract
While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
