Towards Human-Guided, Data-Centric LLM Co-Pilots
Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar

TL;DR
This paper introduces CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data tools with reasoning to improve data handling in real-world ML applications.
Contribution
It presents a novel multi-agent reasoning system with human-in-the-loop guidance and formalizes a taxonomy of data-centric challenges for LLM co-pilots.
Findings
Outperforms existing co-pilots on healthcare datasets
Transforms uncurated data into ML-ready formats
Demonstrates effectiveness across multiple domains
Abstract
Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Scientific Computing and Data Management · Simulation Techniques and Applications
