OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu; Yinger Zhang; Fei Huang; Jianhong Tu; Yang Su; Lianghao Deng; Yuxuan Liu; Yantao Liu; Dayiheng Liu; Tsung-Yi Ho

arXiv:2604.10866·cs.CL·April 17, 2026

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho

PDF

1 Repo 1 Datasets

TL;DR

OccuBench is a comprehensive benchmark that evaluates AI agents across 100 real-world professional tasks in various industries using language environment simulators, revealing diverse capabilities and robustness challenges.

Contribution

This work introduces OccuBench, the first large-scale, multi-domain benchmark for assessing AI agents' performance and robustness in professional environments via simulated scenarios.

Findings

01

No single model excels across all industries, showing diverse occupational profiles.

02

Implicit faults are more challenging than explicit errors, requiring autonomous detection.

03

Larger, newer models with more reasoning improve performance significantly.

Abstract

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gregxmhu/OccuBench
github

Datasets

gregH/OccuBench
dataset· 244 dl
244 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.