LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, and Yehui Tang

TL;DR
LiveClawBench is a new benchmark designed to evaluate LLM agents on complex, real-world assistant tasks by considering environment complexity, cognitive demand, and adaptability.
Contribution
It introduces a Triple-Axis Complexity Framework and a pilot benchmark to assess LLM agents in realistic, compositional assistant scenarios.
Findings
Benchmark covers real-world tasks with annotated complexity factors.
Framework enables evaluation of LLM agents across multiple difficulty dimensions.
Project page provides ongoing updates and expanded task collections.
Abstract
LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
