Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li; Zhengyang Tang; Mingxin Huang; Yunlong Lin; Shijue Huang; Shengyuan Liu; Bowen Ye; Rang Li; Lei Li; Benyou Wang; Yixuan Yuan

arXiv:2604.28139·cs.SE·May 4, 2026

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li, Zhengyang Tang, Mingxin Huang, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan

PDF

1 Repo

TL;DR

Claw-Eval-Live is a new live benchmark for evaluating workflow agents across evolving real-world tasks, emphasizing external demand signals and detailed execution logs for more accurate assessment.

Contribution

It introduces a live, refreshable benchmark with structured grading and a diverse set of tasks to better evaluate agent performance in dynamic workflows.

Findings

01

Leading models pass only 66.7% of tasks

02

No model reaches 70% success rate

03

Failures are concentrated in HR, management, and multi-system workflows

Abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

claw-eval-live/Claw-Eval-Live
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.