Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models

Yifan Jiang; Yibo Xue; Yukun Kang; Pin Zheng; Jian Peng; Feiran Wu; Changliang Xu

arXiv:2507.03916·cs.AI·July 29, 2025

Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models

Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng, Feiran Wu, Changliang Xu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new dataset and model fine-tuning approach for improving AI understanding and generation of slide animations, addressing a key gap in visual-language models for dynamic presentation content.

Contribution

The authors release the first public dataset for slide-animation modeling and demonstrate that fine-tuning with Low-Rank Adaptation significantly enhances animation comprehension and generation capabilities of vision-language models.

Findings

01

LoRA fine-tuning improves BLEU-4 by 60% and ROUGE-L by 30%.

02

The CODA metric effectively evaluates animation action coverage, order, and detail.

03

The dataset enables better temporal reasoning in slide animation understanding.

Abstract

Slide animations, such as fade-in, fly-in, and wipe, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Jyf9774/PPTAnimation_Test
dataset· 222 dl
222 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis