The Human Factor in Data Cleaning: Exploring Preferences and Biases
Hazim AbdElazim, Shadman Islam, Mostafa Milani

TL;DR
This study reveals that human data cleaning is influenced by cognitive biases such as framing, anchoring, and heuristics, affecting error detection and correction decisions, which has implications for designing better human-in-the-loop systems.
Contribution
It provides empirical evidence of cognitive biases in human data cleaning tasks and suggests design principles for more effective human-in-the-loop data cleaning systems.
Findings
Biases affect error detection and correction decisions.
Surface formatting influences false error flags.
People prefer omission over imputation in repairs.
Abstract
Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Research Data Management Practices · Scientific Computing and Data Management
