MEXMA: Token-level objectives improve sentence representations

Jo\~ao Maria Janeiro; Benjamin Piwowarski; Patrick Gallinari; Lo\"ic; Barrault

arXiv:2409.12737·cs.CL·September 20, 2024

MEXMA: Token-level objectives improve sentence representations

Jo\~ao Maria Janeiro, Benjamin Piwowarski, Patrick Gallinari, Lo\"ic, Barrault

PDF

1 Repo 1 Models 1 Video 3 Reviews

TL;DR

MEXMA introduces token-level objectives into cross-lingual sentence encoding, enhancing the quality of sentence representations by leveraging token prediction across languages, leading to improved performance on various tasks.

Contribution

The paper presents MEXMA, a novel method that combines sentence-level and token-level objectives for better cross-lingual sentence encoding.

Findings

01

Outperforms existing cross-lingual encoders on bi-text mining

02

Significantly improves downstream task performance

03

Enhances information encoding in token representations

Abstract

Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 5

Strengths

1. This paper proposes a multi-grained training objective for fine-tuning a pre-trained language model to enhance cross-lingual sentence representation. The method is straightforward to reproduce. 2. Experimental results demonstrate that incorporating token-level objectives into the training of cross-lingual sentence encoders (CLSE) significantly improves the quality of sentence representations, surpassing the performance of current state-of-the-art pre-trained CLSE models in bitext mining and

Weaknesses

1. The concept of learning both sentence-level and token-level alignment has been explored in several studies, such as those by Wei et al. (2021) and Fan et al. (2022). The authors should clarify the differences and advantages of their proposed method compared to these previous works. 2. There are several cross-lingual benchmarks, such as XTREME, that could be used for comparison. The authors are encouraged to evaluate their method against other works using these benchmarks and provide a detail

Reviewer 02Rating 5Confidence 4

Strengths

- This is a well-written paper that introduces an effective framework for training cross-lingual sentence encoders. - The performance is very strong, and I foresee that MEXMA will be widely adopted in various tasks that need cross-lingual sentence representations.

Weaknesses

- My major concern is the novelty. In fact, the lexical-level approach is not as novel as the authors claim. This ICLR 2020 paper (https://openreview.net/forum?id=r1xCMyBtPS) and this NAACL 2019 paper (https://aclanthology.org/N19-1162/), neither mentioned in the submission, have explored very similar post-training fine-tuning ideas to improve cross-lingual sentence/contextualized word representations. All the significant components are adapted from existing work. Considering the above, combined

Reviewer 03Rating 8Confidence 3

Strengths

This work provides a valuable contribution by proposing a novel, intuitive, and effective approach to train sentence representations from multilingual aligned data. The paper's contents are structured well, and the method is presented clearly and concisely while also motivating its differences from previous approaches. The experiments are very comprehensive and show clear improvements across a diverse set of benchmarks. Ablations are performed thoroughly and further support the authors' design c

Weaknesses

As a general comment, Section 5.6 was interesting, but I find it not very well motivated. It is unclear what useful information is to be gained by knowing about the lowest entropy of attention probabilities in MEXMA as opposed to LABSE. On the other hand, the attention sink phenomenon shown in Figure 11 seemed more interesting for readers interested in understanding mechanisms in the MEXMA model. Still, it was placed in the Appendix and never mentioned in the main body of the paper.

Code & Models

Repositories

facebookresearch/mexma
pytorchOfficial

Models

🤗
facebook/MEXMA
model· 3.0k dl· ♡ 31
3.0k dl♡ 31

Videos

MEXMA: Token-level objectives improve sentence representations· underline