LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Wei Huang; Anda Cheng; Yinggui Wang; Lei Wang; Tao Wei

arXiv:2601.20375·cs.LG·May 8, 2026

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei

PDF

TL;DR

LLM-AutoDP introduces an automated, privacy-preserving framework using LLM agents to optimize data processing strategies for model fine-tuning, significantly improving performance and efficiency.

Contribution

The paper presents a novel LLM-based framework that automates data processing strategy generation and optimization without exposing raw data, with new techniques for faster search.

Findings

01

Models trained on processed data outperform unprocessed data by over 80% win rate.

02

LLM-AutoDP achieves about 65% win rate over AutoML baselines.

03

Acceleration techniques reduce search time by up to 10 times.

Abstract

Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.