The Human Factor in Data Cleaning: Exploring Preferences and Biases

Hazim AbdElazim; Shadman Islam; Mostafa Milani

arXiv:2602.19368·cs.DB·March 26, 2026

The Human Factor in Data Cleaning: Exploring Preferences and Biases

Hazim AbdElazim, Shadman Islam, Mostafa Milani

PDF

Open Access

TL;DR

This study reveals that human data cleaning is influenced by cognitive biases such as framing, anchoring, and heuristics, affecting error detection and correction decisions, which has implications for designing better human-in-the-loop systems.

Contribution

It provides empirical evidence of cognitive biases in human data cleaning tasks and suggests design principles for more effective human-in-the-loop data cleaning systems.

Findings

01

Biases affect error detection and correction decisions.

02

Surface formatting influences false error flags.

03

People prefer omission over imputation in repairs.

Abstract

Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Research Data Management Practices · Scientific Computing and Data Management