AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts

Baorong Huang; Ali Asiri

arXiv:2512.21842·cs.CL·January 5, 2026

AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts

Baorong Huang, Ali Asiri

PDF

Open Access

TL;DR

AlignAR introduces a novel generative sentence alignment method and provides a new Arabic-English dataset for legal and literary texts, demonstrating improved robustness of LLM-based approaches over traditional methods.

Contribution

The paper presents AlignAR, a new generative alignment method, and releases a challenging Arabic-English dataset, highlighting limitations of existing methods and showcasing the effectiveness of LLM-based approaches.

Findings

01

LLM-based approaches achieved an F1-score of 85.5%.

02

Traditional methods struggle with complex, non-one-to-one alignments.

03

The new dataset exposes limitations of existing alignment techniques.

Abstract

High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising simple legal and complex literary parallel texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated better robustness, achieving an overall F1-score of 85.5%, a nearly 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification