Aligning Noisy Parallel Corpora Across Language Groups : Word Pair Feature Matching by Dynamic Time Warping
Pascale Fung (Columbia University), Kathleen McKeown (Columbia, University)

TL;DR
This paper introduces DK-vec, a dynamic time warping-based algorithm for aligning noisy parallel texts across language groups, improving accuracy by leveraging frequency, position, and recency features without relying on sentence boundaries.
Contribution
The paper presents a novel DK-vec algorithm that enhances alignment of noisy parallel corpora by integrating multiple features and dynamic time warping, without requiring sentence boundary information.
Findings
DK-vec outperforms previous alignment methods on noisy corpora
Produces accurate bilingual lexicons for cross-language alignment
Handles non-linear noise effectively in parallel texts
Abstract
We propose a new algorithm called DK-vec for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. DK-vec improves on previous alignment algorithms in that it handles better the non-linear nature of noisy corpora. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
