From Pre-labeling to Production: Engineering Lessons from a Machine Learning Pipeline in the Public Sector
Ronivaldo Ferreira, Guilherme da Silva, Carla Rocha, Gustavo Pinto

TL;DR
This paper examines the challenges and engineering lessons of deploying machine learning systems in the public sector, emphasizing the importance of disciplined data governance and institutional infrastructure for trust and sustainability.
Contribution
It highlights practical engineering strategies and organizational barriers in public-sector ML deployment, emphasizing the need for transparent, reproducible, and accountable data infrastructures.
Findings
Pre-labeling with LLMs speeds development but risks traceability.
Splitting models into routed classifiers improves modularity.
Synthetic data generation introduces cost and reliability concerns.
Abstract
Machine learning is increasingly being embedded into government digital platforms, but public-sector constraints make it difficult to build ML systems that are accurate, auditable, and operationally sustainable. In practice, teams face not only technical issues like extreme class imbalance and data drift, but also organizational barriers such as bureaucratic data access, lack of versioned datasets, and incomplete governance over provenance and monitoring. Our study of the Brasil Participativo (BP) platform shows that common engineering choices -- like using LLMs for pre-labeling, splitting models into routed classifiers, and generating synthetic data -- can speed development but also introduce new traceability, reliability, and cost risks if not paired with disciplined data governance and human validation. This means that, in the public sector, responsible ML is not just a modeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Ethics and Social Impacts of AI · Big Data and Business Intelligence
