Towards Human-Guided, Data-Centric LLM Co-Pilots

Evgeny Saveliev; Jiashuo Liu; Nabeel Seedat; Anders Boyd; Mihaela van der Schaar

arXiv:2501.10321·cs.LG·December 22, 2025

Towards Human-Guided, Data-Centric LLM Co-Pilots

Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar

PDF

Open Access

TL;DR

This paper introduces CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data tools with reasoning to improve data handling in real-world ML applications.

Contribution

It presents a novel multi-agent reasoning system with human-in-the-loop guidance and formalizes a taxonomy of data-centric challenges for LLM co-pilots.

Findings

01

Outperforms existing co-pilots on healthcare datasets

02

Transforms uncurated data into ML-ready formats

03

Demonstrates effectiveness across multiple domains

Abstract

Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Scientific Computing and Data Management · Simulation Techniques and Applications