inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Fanshuang Kong; Richong Zhang; Qiyu Sun; Zhijie Nie; Ting Deng; Chunming Hu

arXiv:2601.21543·cs.CL·February 9, 2026

inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu

PDF

Open Access

TL;DR

InversedMixup is a novel data augmentation method that reconstructs interpretable sentences from mixed embeddings by aligning task-specific models with large language models, enhancing augmentation control and effectiveness.

Contribution

It introduces a three-stage training framework for embedding alignment, enabling interpretable mixed-sentence generation and addressing manifold intrusion in text Mixup.

Findings

01

Effective in few-shot and fully supervised settings

02

First empirical evidence of manifold intrusion in text Mixup

03

Improves augmentation quality and interpretability

Abstract

Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques