Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models
Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng, Feiran Wu, Changliang Xu

TL;DR
This paper introduces a new dataset and model fine-tuning approach for improving AI understanding and generation of slide animations, addressing a key gap in visual-language models for dynamic presentation content.
Contribution
The authors release the first public dataset for slide-animation modeling and demonstrate that fine-tuning with Low-Rank Adaptation significantly enhances animation comprehension and generation capabilities of vision-language models.
Findings
LoRA fine-tuning improves BLEU-4 by 60% and ROUGE-L by 30%.
The CODA metric effectively evaluates animation action coverage, order, and detail.
The dataset enables better temporal reasoning in slide animation understanding.
Abstract
Slide animations, such as fade-in, fly-in, and wipe, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
