OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen, Lin, Jingbo Shang

TL;DR
OfficeBench introduces a comprehensive benchmark for evaluating large language model agents' ability to perform complex, multi-application office tasks involving planning, switching, and decision-making, revealing current limitations and areas for improvement.
Contribution
This work presents one of the first benchmarks for realistic office automation tasks, emphasizing multi-application integration and long-horizon planning for LLM agents.
Findings
GPT-4 Omni achieves 47% pass rate on OfficeBench.
Current models underperform humans in accuracy and reliability.
Main issues include operation redundancy, hallucinations, and application switching limitations.
Abstract
Office automation significantly enhances human productivity by automatically finishing routine tasks in the workflow. Beyond the basic information extraction studied in much of the prior document AI literature, the office automation research should be extended to more realistic office tasks which require to integrate various information sources in the office system and produce outputs through a series of decision-making processes. We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office workflows. OfficeBench requires LLM agents to perform feasible long-horizon planning, proficiently switch between applications in a timely manner, and accurately ground their actions within a large combined action space, based on the contextual demands of the workflow. Applying our customized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Scheduling and Optimization Algorithms · Advanced Manufacturing and Logistics Optimization
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
