Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations
Chen Liang, Xirui Jiang, Naihao Deng, Eytan Adar, Anhong Guo

TL;DR
This paper introduces AniMINT, a dataset of 300 UI animation videos, and evaluates how well current Vision Language Models understand dynamic UI animations, revealing strengths and limitations.
Contribution
The paper presents AniMINT, a new dataset for UI animation understanding, and systematically assesses VLMs' capabilities in perceiving and interpreting UI animations.
Findings
VLMs reliably detect primitive motion
High-level animation interpretation is inconsistent
Key bottlenecks identified using MCPC cues
Abstract
AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
