Targeted Data Generation: Finding and Fixing Model Weaknesses

Zexue He; Marco Tulio Ribeiro; Fereshte Khani

arXiv:2305.17804·cs.CL·May 30, 2023·1 cites

Targeted Data Generation: Finding and Fixing Model Weaknesses

Zexue He, Marco Tulio Ribeiro, Fereshte Khani

PDF

Open Access

TL;DR

This paper introduces Targeted Data Generation (TDG), a framework that identifies challenging data subgroups and uses large language models with human oversight to generate targeted data, improving model fairness and accuracy.

Contribution

TDG automatically finds difficult subgroups and generates data for them, balancing subgroup improvement with overall model performance, a novel approach to addressing model weaknesses.

Findings

01

Significantly improves accuracy on challenging subgroups.

02

Enhances overall test accuracy.

03

Effective in sentiment analysis and natural language inference.

Abstract

Even when aggregate accuracy is high, state-of-the-art NLP models often fail systematically on specific subgroups of data, resulting in unfair outcomes and eroding user trust. Additional data collection may not help in addressing these weaknesses, as such challenging subgroups may be unknown to users, and underrepresented in the existing and new data. We propose Targeted Data Generation (TDG), a framework that automatically identifies challenging subgroups, and generates new data for those subgroups using large language models (LLMs) with a human in the loop. TDG estimates the expected benefit and potential harm of data augmentation for each subgroup, and selects the ones most likely to improve within group performance without hurting overall performance. In our experiments, TDG significantly improves the accuracy on challenging subgroups for state-of-the-art sentiment analysis and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques

Methodsfail · Test