TL;DR
MEXMA introduces token-level objectives into cross-lingual sentence encoding, enhancing the quality of sentence representations by leveraging token prediction across languages, leading to improved performance on various tasks.
Contribution
The paper presents MEXMA, a novel method that combines sentence-level and token-level objectives for better cross-lingual sentence encoding.
Findings
Outperforms existing cross-lingual encoders on bi-text mining
Significantly improves downstream task performance
Enhances information encoding in token representations
Abstract
Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This paper proposes a multi-grained training objective for fine-tuning a pre-trained language model to enhance cross-lingual sentence representation. The method is straightforward to reproduce. 2. Experimental results demonstrate that incorporating token-level objectives into the training of cross-lingual sentence encoders (CLSE) significantly improves the quality of sentence representations, surpassing the performance of current state-of-the-art pre-trained CLSE models in bitext mining and
1. The concept of learning both sentence-level and token-level alignment has been explored in several studies, such as those by Wei et al. (2021) and Fan et al. (2022). The authors should clarify the differences and advantages of their proposed method compared to these previous works. 2. There are several cross-lingual benchmarks, such as XTREME, that could be used for comparison. The authors are encouraged to evaluate their method against other works using these benchmarks and provide a detail
- This is a well-written paper that introduces an effective framework for training cross-lingual sentence encoders. - The performance is very strong, and I foresee that MEXMA will be widely adopted in various tasks that need cross-lingual sentence representations.
- My major concern is the novelty. In fact, the lexical-level approach is not as novel as the authors claim. This ICLR 2020 paper (https://openreview.net/forum?id=r1xCMyBtPS) and this NAACL 2019 paper (https://aclanthology.org/N19-1162/), neither mentioned in the submission, have explored very similar post-training fine-tuning ideas to improve cross-lingual sentence/contextualized word representations. All the significant components are adapted from existing work. Considering the above, combined
This work provides a valuable contribution by proposing a novel, intuitive, and effective approach to train sentence representations from multilingual aligned data. The paper's contents are structured well, and the method is presented clearly and concisely while also motivating its differences from previous approaches. The experiments are very comprehensive and show clear improvements across a diverse set of benchmarks. Ablations are performed thoroughly and further support the authors' design c
As a general comment, Section 5.6 was interesting, but I find it not very well motivated. It is unclear what useful information is to be gained by knowing about the lowest entropy of attention probabilities in MEXMA as opposed to LABSE. On the other hand, the attention sink phenomenon shown in Figure 11 seemed more interesting for readers interested in understanding mechanisms in the MEXMA model. Still, it was placed in the Appendix and never mentioned in the main body of the paper.
Code & Models
Videos
