Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification

{\L}ukasz Radli\'nski; Mateusz Gu\'sciora; Jan Koco\'n

arXiv:2507.14590·cs.CL·July 22, 2025

Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification

{\L}ukasz Radli\'nski, Mateusz Gu\'sciora, Jan Koco\'n

PDF

TL;DR

This study compares traditional data augmentation methods like backtranslation and paraphrasing with generative approaches using large language models for emotion classification, finding traditional methods often perform as well or better.

Contribution

The paper systematically evaluates traditional and generative data augmentation methods for NLP, highlighting the effectiveness of backtranslation and paraphrasing in the LLM era.

Findings

01

Backtranslation and paraphrasing perform comparably or better than generative methods.

02

Traditional augmentation methods are effective for emotion classification.

03

Generative methods do not always outperform traditional approaches.

Abstract

Numerous domain-specific machine learning tasks struggle with data scarcity and class imbalance. This paper systematically explores data augmentation methods for NLP, particularly through large language models like GPT. The purpose of this paper is to examine and evaluate whether traditional methods such as paraphrasing and backtranslation can leverage a new generation of models to achieve comparable performance to purely generative methods. Methods aimed at solving the problem of data scarcity and utilizing ChatGPT were chosen, as well as an exemplary dataset. We conducted a series of experiments comparing four different approaches to data augmentation in multiple experimental setups. We then evaluated the results both in terms of the quality of generated data and its impact on classification performance. The key findings indicate that backtranslation and paraphrasing can yield…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.