An Algorithm for Aligning Sentences in Bilingual Corpora Using Lexical   Information

Akshar Bharati; V.Sriram; A.Vamshi Krishna; Rajeev Sangal; S.M.Bendre

arXiv:cs/0302014·cs.CL·May 23, 2007·6 cites

An Algorithm for Aligning Sentences in Bilingual Corpora Using Lexical Information

Akshar Bharati, V.Sriram, A.Vamshi Krishna, Rajeev Sangal, S.M.Bendre

PDF

Open Access

TL;DR

This paper introduces a language-independent algorithm for aligning sentences in bilingual corpora by leveraging lexical information, improving accuracy especially where statistical methods fall short.

Contribution

The proposed algorithm uniquely uses lexical information and heuristics for sentence alignment, outperforming statistical methods in certain challenging cases.

Findings

01

Comparable results with existing algorithms in most cases

02

Better performance in cases where statistical algorithms fail

03

Language independence of the alignment method

Abstract

In this paper we describe an algorithm for aligning sentences with their translations in a bilingual corpus using lexical information of the languages. Existing efficient algorithms ignore word identities and consider only the sentence lengths (Brown, 1991; Gale and Church, 1993). For a sentence in the source language text, the proposed algorithm picks the most likely translation from the target language text using lexical information and certain heuristics. It does not do statistical analysis using sentence lengths. The algorithm is language independent. It also aids in detecting addition and deletion of text in translations. The algorithm gives comparable results with the existing algorithms in most of the cases while it does better in cases where statistical algorithms do not give good results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems