Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

Chen Liang; Xirui Jiang; Naihao Deng; Eytan Adar; Anhong Guo

arXiv:2604.26148·cs.HC·April 30, 2026

Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

Chen Liang, Xirui Jiang, Naihao Deng, Eytan Adar, Anhong Guo

PDF

1 Datasets

TL;DR

This paper introduces AniMINT, a dataset of 300 UI animation videos, and evaluates how well current Vision Language Models understand dynamic UI animations, revealing strengths and limitations.

Contribution

The paper presents AniMINT, a new dataset for UI animation understanding, and systematically assesses VLMs' capabilities in perceiving and interpreting UI animations.

Findings

01

VLMs reliably detect primitive motion

02

High-level animation interpretation is inconsistent

03

Key bottlenecks identified using MCPC cues

Abstract

AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

pubacc/AniMINT
dataset· 473 dl
473 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.