TL;DR
ClawGym introduces a comprehensive, scalable framework for developing, training, and evaluating Claw-style personal agents using synthesized datasets, hybrid verification, and benchmark resources.
Contribution
It provides a systematic platform for synthesizing training data, training models, and benchmarking Claw agents, addressing scalability and verifiability challenges.
Findings
Constructed a dataset of 13.5K tasks from persona-driven intents.
Trained ClawGym-Agents via supervised fine-tuning and reinforcement learning.
Developed ClawGym-Bench with 200 benchmark instances for evaluation.
Abstract
Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
