UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Darvin Yi; Teng Liu; Mattie Terzolo; Lance Hasson; Ayan Sinha; Pablo Mendes; Andrew Rabinovich

arXiv:2511.12306·cs.AI·December 15, 2025

UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich

PDF

Open Access

TL;DR

UpBench is a dynamic, real-world labor-market benchmark framework for evaluating AI agents' competence, adaptability, and collaboration skills using genuine work tasks from the Upwork platform, with expert human evaluation.

Contribution

It introduces a novel, evolving benchmark based on authentic jobs, incorporating expert rubric-based assessments for detailed analysis of AI performance in real-world contexts.

Findings

01

Provides a scalable, human-centered evaluation framework

02

Enables fine-grained analysis of AI strengths and weaknesses

03

Supports research on human-AI collaboration

Abstract

As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Economy and Work Transformation · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing