Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Kenji Hilasaca; Nouran Khallaf; and Serge Sharoff

arXiv:2605.09476·cs.CL·May 12, 2026

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Kenji Hilasaca, Nouran Khallaf, and Serge Sharoff

PDF

TL;DR

This paper presents a method for creating high-quality, multilingual sentence-aligned corpora for text simplification, addressing data scarcity in non-English languages.

Contribution

It introduces a novel approach for sentence-level alignment from comparable corpora and releases a multilingual dataset for text simplification.

Findings

01

Constructed a multilingual corpus for text simplification in five languages.

02

Developed mechanisms for sentence-level alignment from document-level data.

03

The dataset is publicly available for research use.

Abstract

Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.