Multilingual Sentence Transformer as A Multilingual Word Aligner

Weikang Wang; Guanhua Chen; Hanqing Wang; Yue Han; Yun Chen

arXiv:2301.12140·cs.CL·January 31, 2023

Multilingual Sentence Transformer as A Multilingual Word Aligner

Weikang Wang, Guanhua Chen, Hanqing Wang, Yue Han, Yun Chen

PDF

Open Access 1 Repo

TL;DR

This paper explores the use of LaBSE, a multilingual sentence transformer, as a word aligner, demonstrating its effectiveness and improvements over existing models through fine-tuning on parallel corpora.

Contribution

The study shows that LaBSE, originally designed for sentence embeddings, can be effectively adapted for word alignment, outperforming previous models and supporting zero-shot language pairs.

Findings

01

LaBSE outperforms other mPLMs in word alignment tasks.

02

Fine-tuning LaBSE improves alignment accuracy across seven language pairs.

03

The model achieves state-of-the-art results, including zero-shot language pairs.

Abstract

Multilingual pretrained language models (mPLMs) have shown their effectiveness in multilingual word alignment induction. However, these methods usually start from mBERT or XLM-R. In this paper, we investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner. This idea is non-trivial as LaBSE is trained to learn language-agnostic sentence-level embeddings, while the alignment extraction task requires the more fine-grained word-level embeddings to be language-agnostic. We demonstrate that the vanilla LaBSE outperforms other mPLMs currently used in the alignment task, and then propose to finetune LaBSE on parallel corpus for further improvement. Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties. In addition, our aligner supports different language pairs in a single model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sufenlp/accalign
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · XLM-R · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention