GUI Knowledge Bench: Revealing the Knowledge Gap of VLMs in GUI Tasks

Chenrui Shi; Zedong Yu; Zhi Gao; Ruining Feng; Enqi Liu; Yuwei Wu; Yunde Jia; Liuyu Xiang; Zhaofeng He; Qing Li

arXiv:2510.26098·cs.AI·February 10, 2026

GUI Knowledge Bench: Revealing the Knowledge Gap of VLMs in GUI Tasks

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li

PDF

TL;DR

This paper introduces GUI Knowledge Bench, a benchmark to evaluate vision language models' understanding of GUI-specific knowledge, revealing current limitations and guiding future development of more capable GUI agents.

Contribution

The paper defines a structured GUI knowledge framework, creates a comprehensive benchmark across platforms, and analyzes VLMs' knowledge gaps in GUI tasks.

Findings

01

VLMs understand widget functions but lack system state awareness

02

Current models struggle with GUI interaction conventions

03

GUI knowledge correlates with task success in real-world applications

Abstract

Vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface knowledge about widget functions, layout semantics, and system states; (2) interaction knowledge about GUI interaction types and effects; and (3) procedure knowledge of task objectives and workflow sequences. We further introduce GUI Knowledge Bench, a benchmark with multiple-choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation indicates that current VLMs are generally aware of the functions of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.