From Pre-labeling to Production: Engineering Lessons from a Machine Learning Pipeline in the Public Sector

Ronivaldo Ferreira; Guilherme da Silva; Carla Rocha; Gustavo Pinto

arXiv:2511.01545·cs.SE·November 4, 2025

From Pre-labeling to Production: Engineering Lessons from a Machine Learning Pipeline in the Public Sector

Ronivaldo Ferreira, Guilherme da Silva, Carla Rocha, Gustavo Pinto

PDF

Open Access

TL;DR

This paper examines the challenges and engineering lessons of deploying machine learning systems in the public sector, emphasizing the importance of disciplined data governance and institutional infrastructure for trust and sustainability.

Contribution

It highlights practical engineering strategies and organizational barriers in public-sector ML deployment, emphasizing the need for transparent, reproducible, and accountable data infrastructures.

Findings

01

Pre-labeling with LLMs speeds development but risks traceability.

02

Splitting models into routed classifiers improves modularity.

03

Synthetic data generation introduces cost and reliability concerns.

Abstract

Machine learning is increasingly being embedded into government digital platforms, but public-sector constraints make it difficult to build ML systems that are accurate, auditable, and operationally sustainable. In practice, teams face not only technical issues like extreme class imbalance and data drift, but also organizational barriers such as bureaucratic data access, lack of versioned datasets, and incomplete governance over provenance and monitoring. Our study of the Brasil Participativo (BP) platform shows that common engineering choices -- like using LLMs for pre-labeling, splitting models into routed classifiers, and generating synthetic data -- can speed development but also introduce new traceability, reliability, and cost risks if not paired with disciplined data governance and human validation. This means that, in the public sector, responsible ML is not just a modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Ethics and Social Impacts of AI · Big Data and Business Intelligence