GUI Action Narrator: Where and When Did That Action Take Place?

Qinchen Wu; Difei Gao; Kevin Qinghong Lin; Zhuoyu Wu; Xiangwu Guo,; Peiran Li; Weichen Zhang; Hengxu Wang; Mike Zheng Shou

arXiv:2406.13719·cs.CV·June 21, 2024

GUI Action Narrator: Where and When Did That Action Take Place?

Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo,, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

PDF

Open Access

TL;DR

This paper introduces a new GUI action captioning benchmark and a framework that uses cursor prompts to improve understanding of GUI videos, addressing challenges like dense information and rapid events.

Contribution

It presents the Act2Cap dataset and GUI Narrator framework, advancing GUI video captioning with multimodal LLMs by leveraging cursor prompts for better interpretation.

Findings

01

The dataset contains 4,189 diverse GUI video captioning samples.

02

The GUI Narrator framework improves captioning performance using cursor-based prompts.

03

Even advanced models like GPT-4o find GUI captioning highly challenging.

Abstract

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications

MethodsSoftmax · Attention Is All You Need