CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

Hanjun Luo; Chiming Ni; Jiaheng Wen; Zhimu Huang; Yiran Wang; Bingduo Liao; Sylvia Chung; Yingbin Jin; Xinfeng Li; Wenyuan Xu; XiaoFeng Wang; Hanan Salam

arXiv:2512.04111·cs.SE·May 22, 2026

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

Hanjun Luo, Chiming Ni, Jiaheng Wen, Zhimu Huang, Yiran Wang, Bingduo Liao, Sylvia Chung, Yingbin Jin, Xinfeng Li, Wenyuan Xu, XiaoFeng Wang, Hanan Salam

PDF

3 Reviews

TL;DR

CentaurEval introduces a novel benchmark for evaluating human-in-the-loop value in coding, emphasizing collaboration between humans and AI to solve complex tasks that are intractable for either alone.

Contribution

It presents a new ecologically valid benchmark with collaboration-necessary problems, enabling standardized assessment of human-AI teamwork in coding tasks.

Findings

01

Collaboration significantly improves success rates from below 20% to over 30%.

02

Neither humans nor LLMs alone perform well on collaboration-necessary tasks.

03

Emerging co-reasoning partnership challenges traditional human-tool hierarchies.

Abstract

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift, excluding problems that require both human reasoning to guide solutions and AI efficiency for implementation. We introduce CentaurEval, a unified, ecologically valid benchmark for measuring human-in-the-loop value in coding. CentaurEval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for standalone LLMs or humans, but solvable through effective collaboration. CentaurEval dynamically instantiates tasks from 45 templates, providing a standardized IDE for humans and a reproducible 450-task toolkit for LLMs. We benchmark 45 participants against 5 LLMs under 4 levels of human intervention. Results show that while LLMs or humans alone achieve poor pass rates…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 5

Strengths

- I like the idea. It does indeed identify a key gap in existing benchmarks. It also constructs the benchmark in a way that fits a lot with my own experiences on where Agentic AI coding systems are most useful.

Weaknesses

- I think my biggest criticism (which is a relatively small one) is that the sample of participants is very biased. This should be mentioned in the main text. I think it would be sufficient to note the biggest biases in the sample in the main text, namely that all the participants identified as East Asian and that all of the participants regularly use AI coding assistants (this would just imply to the reader that some care needs to be taken when drawing conclusions). It should also be noted (tho

Reviewer 02Rating 2Confidence 4

Strengths

This is a really important direction. Regarding originality, I am familiar with some work evaluating how AI might enhance human performance, but not a standardized benchmark for comparing human-AI teaming with just one of these factors. Trying to construct problems that expose the value of human-AI partnership is a unique approach that I think is conceptually very interesting. Regarding significance, the domain should broadly be of interest to many, as increasingly, human-AI teaming is the norm

Weaknesses

While I think the approach of finding problems that are AI-incomplete but that are amenable to human reliance on AI for parts is a really interesting one, I’m left with a question: is the way that this is done meant to capture the real ways that humans and AI complement each other, and if so, does it accomplish this? The construction of this benchmark seems to rely on some intuitive building blocks for how humans and AI complement each other (e.g. humans providing clarification and decomposition

Reviewer 03Rating 6Confidence 3

Strengths

I find the user study and interface good contributions, but the most relevant contribution the approach the paper takes to the creation of tasks that are ecologically relevant but neither are solvable by humans nor LLMs alone. For me, this is the main contribution, and the rest is an evaluation of this method of creating tasks.

Weaknesses

Some of the evaluations seem as if they are satisfied by construction. E.g., the fact that SOTA LLMs can't solve the tasks is in the specification of the task creation algorithm.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education