GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented   Understanding

Dongping Chen; Yue Huang; Siyuan Wu; Jingyu Tang; Liuyi Chen; Yilin; Bai; Zhigang He; Chenlong Wang; Huichi Zhou; Yiqiang Li; Tianshuo Zhou; Yue; Yu; Chujie Gao; Qihui Zhang; Yi Gui; Zhen Li; Yao Wan; Pan Zhou; Jianfeng; Gao; Lichao Sun

arXiv:2406.10819·cs.CV·March 25, 2025·1 cites

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin, Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue, Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng, Gao, Lichao Sun

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces GUI-World, a comprehensive dataset for evaluating multimodal large language models' ability to understand dynamic, multi-step GUI content across various scenarios, highlighting current limitations and proposing a fine-tuned Video LLM as a potential solution.

Contribution

The paper presents GUI-World, a new dataset with annotations covering diverse GUI scenarios and questions, and evaluates state-of-the-art models, revealing challenges and proposing a fine-tuned Video LLM for GUI understanding.

Findings

01

Current models struggle with dynamic GUI content without manual annotations.

02

Video LLMs underperform on GUI tasks due to dataset sparsity.

03

Fine-tuned Video LLM shows improved GUI understanding.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keplerlab/katna
noneOfficial

Models

🤗
ONE-Lab/GUI-Vid
model· ♡ 5
♡ 5

Datasets

ONE-Lab/GUI-World
dataset· 6.6k dl
6.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multi-Agent Systems and Negotiation

MethodsBalanced Selection