Neural Machine Translation Data Generation and Augmentation using   ChatGPT

Wayne Yang; Garrett Nicolai

arXiv:2307.05779·cs.CL·July 13, 2023·2 cites

Neural Machine Translation Data Generation and Augmentation using ChatGPT

Wayne Yang, Garrett Nicolai

PDF

Open Access

TL;DR

This paper explores using ChatGPT to generate hallucinated parallel data for neural machine translation, demonstrating that such synthetic data can enhance translation quality despite limited diversity.

Contribution

It introduces a novel approach of leveraging ChatGPT for data augmentation in machine translation, showing improvements over traditional methods.

Findings

01

Hallucinated data improves translation signal.

02

Synthetic data benefits even with domain mismatch.

03

Limited diversity in generated data still enhances performance.

Abstract

Neural models have revolutionized the field of machine translation, but creating parallel corpora is expensive and time-consuming. We investigate an alternative to manual parallel corpora - hallucinated parallel corpora created by generative language models. Although these models are themselves trained on parallel data, they can leverage a multilingual vector space to create data, and may be able to supplement small manually-procured corpora. Our experiments highlight two key findings - despite a lack of diversity in their output, the hallucinated data improves the translation signal, even when the domain clashes with the original dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification