GUI Action Narrator: Where and When Did That Action Take Place?
Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo,, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

TL;DR
This paper introduces a new GUI action captioning benchmark and a framework that uses cursor prompts to improve understanding of GUI videos, addressing challenges like dense information and rapid events.
Contribution
It presents the Act2Cap dataset and GUI Narrator framework, advancing GUI video captioning with multimodal LLMs by leveraging cursor prompts for better interpretation.
Findings
The dataset contains 4,189 diverse GUI video captioning samples.
The GUI Narrator framework improves captioning performance using cursor-based prompts.
Even advanced models like GPT-4o find GUI captioning highly challenging.
Abstract
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications
MethodsSoftmax · Attention Is All You Need
