PPTArena: A Benchmark for Agentic PowerPoint Editing
Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang

TL;DR
PPTArena introduces a comprehensive benchmark for evaluating PowerPoint editing agents on real slide decks, emphasizing in-place modifications guided by natural language, and proposes a structure-aware agent that significantly outperforms existing systems.
Contribution
The paper presents PPTArena, a new benchmark for PowerPoint editing, and introduces PPTPilot, a novel structure-aware editing agent that improves editing accuracy and visual fidelity.
Findings
PPTPilot outperforms proprietary agents by over 10 percentage points.
The benchmark reveals current agents struggle with long-horizon, document-scale tasks.
PPTArena enables evaluation of in-place slide editing under natural language instructions.
Abstract
We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Multimodal Machine Learning Applications · Data Visualization and Analytics
