Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation
Sunil Kothari, Sumukha Sharma Thoppanahalli Chandramouli, Naman Khandelwal, Parth Kulshreshtha, Ashi Jain, Kriti Banka, Tanuja Chintada, Venkata Triveni, Gulipalli Praveen Kumar, Manish Mehta, Tao Liu

TL;DR
Prioritizing early-stage quality assurance in annotation pipelines can significantly reduce costs and errors, yet the machine learning community largely neglects validation timing, which is crucial for effective data quality management.
Contribution
This paper introduces a taxonomy of QA trigger points, formalizes the impact of validation timing on error rates and costs, and highlights the community's neglect of this critical factor.
Findings
Only 4% of surveyed papers report validation timing.
Early QA reduces error propagation and costs significantly.
Empirical evidence from adjacent fields supports early validation benefits.
Abstract
This position paper argues that the machine learning community should prioritize early-stage quality assurance in annotation pipelines over the prevailing practice of late-stage validation. Data quality bottlenecks increasingly limit foundation model improvement, yet quality assurance research focuses almost exclusively on validation methods rather than validation timing. When validation occurs, not merely what methods are employed, fundamentally determines both error rates and annotation costs. This temporal neglect is puzzling given the well-established "shift-left" principle from software engineering, where empirical studies demonstrate 4--100x cost multipliers for defects detected in later stages (Boehm, 1981; Shull et al., 2002). Annotation pipelines exhibit analogous dynamics: errors caught before annotation begins cost a fraction of those discovered after review cycles complete.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
