Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking
Hong Jin Kang, Fabrice Harel-Canada, Muhammad Ali Gulzar, Violet Peng,, Miryung Kim

TL;DR
This paper introduces INSPECTOR, a human-in-the-loop tool that combines provenance tracking and assistive labeling to improve the quality inspection of synthetic texts, significantly increasing correct label identification.
Contribution
The paper presents INSPECTOR, a novel system that integrates provenance tracking with assistive labeling to enhance synthetic text data inspection efficiency.
Findings
INSPECTOR triples the number of correctly labeled texts in user studies.
Grouping texts by transformation provenance is most useful for inspection.
No single technique completely replaces human effort in data quality analysis.
Abstract
Data augmentation techniques apply transformations to existing texts to generate additional data. The transformations may produce low-quality texts, where the meaning of the text is changed and the text may even be mangled beyond human comprehension. Analyzing the synthetically generated texts and their corresponding labels is slow and demanding. To winnow out texts with incorrect labels, we develop INSPECTOR, a human-in-the-loop data inspection technique. INSPECTOR combines the strengths of provenance tracking techniques with assistive labeling. INSPECTOR allows users to group related texts by their transformation provenance, i.e., the transformations applied to the original text, or feature provenance, the linguistic features of the original text. For assistive labeling, INSPECTOR computes metrics that approximate data quality, and allows users to compare the corresponding label of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management · Data Quality and Management · Digital and Cyber Forensics
