Noisy Parallel Data Alignment

Ruoyu Xie; Antonios Anastasopoulos

arXiv:2301.09685·cs.CL·February 13, 2023

Noisy Parallel Data Alignment

Ruoyu Xie, Antonios Anastasopoulos

PDF

Open Access 1 Repo

TL;DR

This paper investigates the robustness of word alignment models in noisy OCR data for under-resourced languages, proposing methods that significantly improve alignment accuracy in such challenging conditions.

Contribution

It introduces a noise simulation and structural biasing approach that enhances the robustness of neural-based word alignment models against noisy OCR outputs.

Findings

01

Reduced alignment error rate by up to 59.6% on multiple language pairs.

02

Demonstrated improved performance of alignment models under noisy conditions.

03

Provided insights into the impact of noise on word alignment accuracy.

Abstract

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruoyuxie/noisy_parallel_data_alignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques