Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria
Dekai Wu (Hong Kong University of Science & Technology)

TL;DR
This paper discusses automatic sentence alignment in English-Chinese texts, exploring a new statistical method that incorporates lexical cues and evaluates its effectiveness on a bilingual corpus.
Contribution
It introduces an improved statistical alignment method that uses lexical cues and assesses its performance on a large English-Chinese corpus.
Findings
The length-based method is applicable to non-Indo-European languages.
Lexical cues improve alignment accuracy.
The approach advances bilingual corpus creation techniques.
Abstract
We describe our experience with automatic alignment of sentences in parallel English-Chinese texts. Our report concerns three related topics: (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus; (2) experiments addressing the applicability of Gale & Church's length-based statistical method to the task of alignment involving a non-Indo-European language; and (3) an improved statistical method that also incorporates domain-specific lexical cues.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
