SMOTExT: SMOTE meets Large Language Models

Mateusz Bystro\'nski; Miko{\l}aj Ho{\l}ysz; Grzegorz Piotrowski; Nitesh V. Chawla; Tomasz Kajdanowicz

arXiv:2505.13434·cs.CL·May 20, 2025

SMOTExT: SMOTE meets Large Language Models

Mateusz Bystro\'nski, Miko{\l}aj Ho{\l}ysz, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz

PDF

Open Access 1 Repo

TL;DR

SMOTExT introduces a novel method combining SMOTE with large language models to generate synthetic text data, addressing class imbalance and data scarcity in NLP, with promising initial results for privacy and low-resource settings.

Contribution

The paper presents SMOTExT, a new approach that adapts SMOTE for textual data using BERT embeddings and xRAG decoding, enabling effective data augmentation and privacy-preserving learning.

Findings

01

Generated synthetic data can achieve comparable performance to real data.

02

Preliminary results indicate potential for privacy-preserving NLP.

03

Method shows promise for knowledge distillation in low-resource scenarios.

Abstract

Data scarcity and class imbalance are persistent challenges in training robust NLP models, especially in specialized domains or low-resource settings. We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data. Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples and then decoding the resulting latent point into text with xRAG architecture. By leveraging xRAG's cross-modal retrieval-generation framework, we can effectively turn interpolated vectors into coherent text. While this is preliminary work supported by qualitative outputs only, the method shows strong potential for knowledge distillation and data augmentation in few-shot settings. Notably, our approach also shows promise for privacy-preserving machine learning: in early experiments, training models solely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MindSpore-scientific-2/code-8/tree/main/SMOTE
mindspore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Domain Adaptation and Few-Shot Learning

MethodsKnowledge Distillation