The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Benedikt Ebing; Goran Glava\v{s}

arXiv:2505.10507·cs.CL·August 11, 2025

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Benedikt Ebing, Goran Glava\v{s}

PDF

Open Access 1 Repo

TL;DR

This paper systematically investigates word aligners for label projection in translation-based cross-lingual transfer for token classification, optimizing design choices and introducing an ensemble method that improves robustness and performance.

Contribution

It provides a detailed analysis of low-level design decisions for word aligners and introduces a novel ensemble projection strategy that surpasses marker-based methods.

Findings

01

Optimized word aligner design choices significantly improve XLT performance.

02

Ensemble of translate-train and translate-test predictions outperforms marker-based projection.

03

Proposed method reduces sensitivity to low-level alignment design decisions.

Abstract

Translation-based strategies for cross-lingual transfer XLT such as translate-train -- training on noisy target language data translated from the source language -- and translate-test -- evaluating on noisy source language data translated from the target language -- are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bebing93/devil-in-details
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification