OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho

TL;DR
OccuBench is a comprehensive benchmark that evaluates AI agents across 100 real-world professional tasks in various industries using language environment simulators, revealing diverse capabilities and robustness challenges.
Contribution
This work introduces OccuBench, the first large-scale, multi-domain benchmark for assessing AI agents' performance and robustness in professional environments via simulated scenarios.
Findings
No single model excels across all industries, showing diverse occupational profiles.
Implicit faults are more challenging than explicit errors, requiring autonomous detection.
Larger, newer models with more reasoning improve performance significantly.
Abstract
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
