OfficeBench: Benchmarking Language Agents across Multiple Applications   for Office Automation

Zilong Wang; Yuedong Cui; Li Zhong; Zimin Zhang; Da Yin; Bill Yuchen; Lin; Jingbo Shang

arXiv:2407.19056·cs.CL·July 30, 2024

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen, Lin, Jingbo Shang

PDF

Open Access 1 Repo

TL;DR

OfficeBench introduces a comprehensive benchmark for evaluating large language model agents' ability to perform complex, multi-application office tasks involving planning, switching, and decision-making, revealing current limitations and areas for improvement.

Contribution

This work presents one of the first benchmarks for realistic office automation tasks, emphasizing multi-application integration and long-horizon planning for LLM agents.

Findings

01

GPT-4 Omni achieves 47% pass rate on OfficeBench.

02

Current models underperform humans in accuracy and reliability.

03

Main issues include operation redundancy, hallucinations, and application switching limitations.

Abstract

Office automation significantly enhances human productivity by automatically finishing routine tasks in the workflow. Beyond the basic information extraction studied in much of the prior document AI literature, the office automation research should be extended to more realistic office tasks which require to integrate various information sources in the office system and produce outputs through a series of decision-making processes. We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office workflows. OfficeBench requires LLM agents to perform feasible long-horizon planning, proficiently switch between applications in a timely manner, and accurately ground their actions within a large combined action space, based on the contextual demands of the workflow. Applying our customized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zlwang-cs/OfficeBench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Scheduling and Optimization Algorithms · Advanced Manufacturing and Logistics Optimization

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections