Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Abhishek Chandwani; Ishan Gupta

arXiv:2603.22744·cs.AI·March 25, 2026

Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Abhishek Chandwani, Ishan Gupta

PDF

Open Access

TL;DR

This paper introduces LH-Bench, a novel evaluation framework for long-horizon, subjective enterprise tasks that incorporates expert rubrics, curated artifacts, and human preferences to reliably assess AI performance beyond binary correctness.

Contribution

The paper presents LH-Bench, a three-pillar evaluation method that enables scalable, reliable assessment of long-term, subjective enterprise AI tasks using expert-grounded rubrics and human preferences.

Findings

01

Expert-grounded rubrics improve evaluation reliability (kappa=0.60).

02

Human preferences validate rubric-based assessments (p<0.05).

03

Public datasets for Figma-to-code and content tasks are released.

Abstract

Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education