GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin, Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue, Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng, Gao, Lichao Sun

TL;DR
This paper introduces GUI-World, a comprehensive dataset for evaluating multimodal large language models' ability to understand dynamic, multi-step GUI content across various scenarios, highlighting current limitations and proposing a fine-tuned Video LLM as a potential solution.
Contribution
The paper presents GUI-World, a new dataset with annotations covering diverse GUI scenarios and questions, and evaluates state-of-the-art models, revealing challenges and proposing a fine-tuned Video LLM for GUI understanding.
Findings
Current models struggle with dynamic GUI content without manual annotations.
Video LLMs underperform on GUI tasks due to dataset sparsity.
Fine-tuned Video LLM shows improved GUI understanding.
Abstract
Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multi-Agent Systems and Negotiation
MethodsBalanced Selection
