CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning
Tianshi Xu, Yuteng Chen, Meng Li

TL;DR
CLEANER introduces a self-purification method for agentic RL that uses model-intrinsic self-correction to improve learning efficiency and accuracy, especially in parameter-constrained models, by constructing cleaner training trajectories.
Contribution
The paper presents CLEANER, a novel trajectory purification approach leveraging self-correction to enhance agentic RL training without external filtering or high computational costs.
Findings
Achieves 6%, 3%, and 5% accuracy improvements on benchmarks.
Matches state-of-the-art performance with only one-third of training steps.
Demonstrates scalable, efficient trajectory purification for agentic RL.
Abstract
Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
