A Sentence Meaning Based Alignment Method for Parallel Text Corpora   Preparation

Krzysztof Wo{\l}k; Krzysztof Marasek

arXiv:1509.09093·cs.CL·October 1, 2015

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

Krzysztof Wo{\l}k, Krzysztof Marasek

PDF

TL;DR

This paper introduces a language-independent sentence alignment method based on semantic heuristics, improving bilingual text preparation for machine translation without relying on positional cues, and demonstrates its effectiveness on TED Talks data.

Contribution

It presents a novel, language-independent sentence alignment approach that incorporates semantic heuristics and improves MT system performance.

Findings

01

Effective alignment on TED Talks corpus

02

Improved MT scores using the aligned data

03

Comparable or better than existing alignment methods

Abstract

Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.