Pralekha: Cross-Lingual Document Alignment for Indic Languages
Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre

TL;DR
Pralekha introduces a large-scale Indic language document alignment benchmark and a novel chunk-based alignment metric, DAC, which improves efficiency and accuracy for parallel document mining in machine translation.
Contribution
The paper presents Pralekha, a new benchmark with over 3 million aligned document pairs for Indic languages, and proposes DAC, a fine-grained, efficient alignment metric that outperforms pooling-based methods.
Findings
DAC achieves 2-3x faster alignment
DAC outperforms pooling-based baselines in accuracy
MT models trained on DAC-aligned data perform better
Abstract
Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Existing methods often rely on metadata such as URLs, which are scarce, or on pooled document representations that fail to capture fine-grained alignment cues. Moreover, the limited context window of sentence embedding models hinders their ability to represent document-level context, while sentence-based alignment introduces a combinatorially large search space, leading to high computational cost. To address these challenges for Indic languages, we introduce Pralekha, a benchmark containing over 3 million aligned document pairs across 11 Indic languages and English, which includes 1.5 million English-Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsDynamic Algorithm Configuration
