TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Benedikt Ebing; Christian Goldschmied; Goran Glava\v{s}

arXiv:2510.27337·cs.CL·November 3, 2025

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Benedikt Ebing, Christian Goldschmied, Goran Glava\v{s}

PDF

Open Access 1 Video

TL;DR

TransAlign leverages multilingual machine translation encoders to produce highly accurate word alignments, significantly improving label projection in cross-lingual transfer tasks compared to existing methods.

Contribution

This paper introduces TransAlign, a novel word aligner based on MT model encoders, demonstrating superior performance over traditional aligners and non-WA label projection methods.

Findings

01

TransAlign outperforms popular word aligners in accuracy.

02

TransAlign improves label projection quality in cross-lingual token classification.

03

MT-based alignments with TransAlign surpass cross-attention methods in encoder-decoder models.

Abstract

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification