LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

Xiang Long; Li Du; Yilong Xu; Fangcheng Liu; Haoqing Wang; Ning Ding; Ziheng Li; Jianyuan Guo; and Yehui Tang

arXiv:2604.13072·cs.CL·April 16, 2026

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, and Yehui Tang

PDF

1 Repo 1 Datasets

TL;DR

LiveClawBench is a new benchmark designed to evaluate LLM agents on complex, real-world assistant tasks by considering environment complexity, cognitive demand, and adaptability.

Contribution

It introduces a Triple-Axis Complexity Framework and a pilot benchmark to assess LLM agents in realistic, compositional assistant scenarios.

Findings

01

Benchmark covers real-world tasks with annotated complexity factors.

02

Framework enables evaluation of LLM agents across multiple difficulty dimensions.

03

Project page provides ongoing updates and expanded task collections.

Abstract

LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Mosi-AI/LiveClawBench
github

Datasets

Mosi-AI/LiveClawbench-trajectories
dataset· 208 dl
208 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.