FormGym: Doing Paperwork with Agents

Matthew Toles; Rattandeep Singh; Isaac Song; Zhou Yu

arXiv:2506.14079·cs.AI·January 23, 2026

FormGym: Doing Paperwork with Agents

Matthew Toles, Rattandeep Singh, Isaac Song, Zhou Yu

PDF

Open Access 1 Video

TL;DR

This paper introduces FormGym, a challenging form-filling benchmark for AI agents in the image domain, and proposes FieldFinder to improve localization, significantly boosting model performance.

Contribution

The paper presents a new form-filling benchmark and a localization tool, FieldFinder, enabling AI agents to perform better in complex, image-based form completion tasks.

Findings

01

Baseline VLAs achieve less than 1% accuracy due to poor localization.

02

GUI agents score between 10.6-68.0%, limited by high cost and latency.

03

FieldFinder improves performance by up to 56% across tasks.

Abstract

Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FormGym: Doing Paperwork with Agents· underline

Taxonomy

TopicsInteractive and Immersive Displays · Modular Robots and Swarm Intelligence