FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, Shan Jiang

TL;DR
FieldWorkArena is a new benchmark for evaluating agentic AI in real-world environments like factories and retail stores, focusing on safety and procedural tasks, with publicly available datasets and evaluation tools.
Contribution
This work introduces a real-world agentic AI benchmark with improved evaluation methods and a comprehensive dataset collected from actual field environments.
Findings
Evaluation confirms feasibility of performance assessment with Multimodal LLMs like GPT-4o.
The dataset includes images and videos from factories, warehouses, and retail stores.
The study highlights both the effectiveness and limitations of the proposed evaluation methodology.
Abstract
This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
