SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

Subhrangshu Nandi; Arghya Datta; Rohith Nama; Udita Patel; Nikhil Vichare; Indranil Bhattacharya; Prince Grover; Shivam Asija; Giuseppe Carenini; Wei Zhang; Arushi Gupta; Sreyoshi Bhaduri; Jing Xu; Huzefa Raja; Shayan Ray; Aaron Chan; Esther Xu Fei; Gaoyuan Du; Zuhaib Akhtar; Harshita Asnani; Weian Chan; Ming Xiong; Francesco Carbone; Jeetu Mirchandani

arXiv:2506.08119·cs.AI·February 24, 2026

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

Subhrangshu Nandi, Arghya Datta, Rohith Nama, Udita Patel, Nikhil Vichare, Indranil Bhattacharya, Prince Grover, Shivam Asija, Giuseppe Carenini, Wei Zhang, Arushi Gupta, Sreyoshi Bhaduri, Jing Xu, Huzefa Raja, Shayan Ray, Aaron Chan, Esther Xu Fei, Gaoyuan Du, Zuhaib Akhtar

PDF

Open Access

TL;DR

SOP-Bench is a comprehensive benchmark with over 2,000 real-world SOP tasks across multiple domains, designed to evaluate and improve LLM-based agents' ability to execute complex, multi-step procedures using human-validated, executable workflows.

Contribution

The paper introduces SOP-Bench, a novel benchmark with authentic, expert-crafted SOP tasks for evaluating LLM agents' procedural reasoning and tool orchestration capabilities.

Findings

01

Newer models do not always perform better; e.g., Claude 4 outperforms Claude 4.5 on ReAct tasks.

02

Performance varies significantly across domains and model-agent combinations.

03

SOP-Bench enables systematic analysis of agent performance without costly real-world deployment.

Abstract

LLM-based agents struggle to execute complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real-world workflows. We introduce SOP-Bench, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human-AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human-validated, yielding realistic tasks with executable interfaces and ground-truth outputs. SOP-Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. We demonstrate its utility through illustrative experiments with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Multi-Agent Systems and Negotiation · Human-Automation Interaction and Safety