iFlip: Iterative Feedback-driven Counterfactual Example Refinement
Yilong Wang, Qianli Wang, Nils Feldhus

TL;DR
iFlip is an iterative method that uses feedback to improve the validity of counterfactual examples generated by large language models, enhancing explainability and data augmentation in NLP.
Contribution
This paper introduces iFlip, a novel iterative refinement approach leveraging multiple feedback types to generate more valid counterfactuals with LLMs, surpassing existing methods.
Findings
iFlip achieves 57.8% higher validity than baselines.
Counterfactuals from iFlip improve model robustness.
User study shows higher satisfaction with iFlip-generated counterfactuals.
Abstract
Counterfactual examples are minimal edits to an input that alter a model's prediction. They are widely employed in explainable AI to probe model behavior and in natural language processing (NLP) to augment training data. However, generating valid counterfactuals with large language models (LLMs) remains challenging, as existing single-pass methods often fail to induce reliable label changes, neglecting LLMs' self-correction capabilities. To explore this untapped potential, we propose iFlip, an iterative refinement approach that leverages three types of feedback, including model confidence, feature attribution, and natural language. Our results show that iFlip achieves an average 57.8% higher validity than the five state-of-the-art baselines, as measured by the label flipping rate. The user study further corroborates that iFlip outperforms baselines in completeness, overall satisfaction,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification
