AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts
Baorong Huang, Ali Asiri

TL;DR
AlignAR introduces a novel generative sentence alignment method and provides a new Arabic-English dataset for legal and literary texts, demonstrating improved robustness of LLM-based approaches over traditional methods.
Contribution
The paper presents AlignAR, a new generative alignment method, and releases a challenging Arabic-English dataset, highlighting limitations of existing methods and showcasing the effectiveness of LLM-based approaches.
Findings
LLM-based approaches achieved an F1-score of 85.5%.
Traditional methods struggle with complex, non-one-to-one alignments.
The new dataset exposes limitations of existing alignment techniques.
Abstract
High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising simple legal and complex literary parallel texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated better robustness, achieving an overall F1-score of 85.5%, a nearly 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
