Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning

Simret Araya Gebreegziabher; Kuangshi Ai; Zheng Zhang; Elena L. Glassman; Toby Jia-Jun Li

arXiv:2408.03819·cs.LG·June 3, 2025

Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning

Simret Araya Gebreegziabher, Kuangshi Ai, Zheng Zhang, Elena L. Glassman, Toby Jia-Jun Li

PDF

Open Access

TL;DR

This paper presents a novel counterfactual data augmentation method inspired by Variation Theory to improve active learning efficiency, especially in low-data scenarios, by synthesizing artificial data points that highlight key features.

Contribution

It introduces a neuro-symbolic pipeline combining LLMs and rule-based models to generate synthetic data for active learning, addressing the cold start problem.

Findings

01

Significantly improves performance with fewer labeled data

02

Reduces the impact of data augmentation as data size increases

03

Addresses cold start problem in active learning

Abstract

Active Learning (AL) allows models to learn interactively from user feedback. This paper introduces a counterfactual data augmentation approach to AL, particularly addressing the selection of datapoints for user querying, a pivotal concern in enhancing data efficiency. Our approach is inspired by Variation Theory, a theory of human concept learning that emphasizes the essential features of a concept by focusing on what stays the same and what changes. Instead of just querying with existing datapoints, our approach synthesizes artificial datapoints that highlight potential key similarities and differences among labels using a neuro-symbolic pipeline combining large language models (LLMs) and rule-based models. Through an experiment in the example domain of text classification, we show that our approach achieves significantly higher performance when there are fewer annotated data. As the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Neural Networks and Applications · Machine Learning and Algorithms