PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Zheng Huang; Xukai Liu; Tianyu Hu; Kai Zhang; Ye Liu

arXiv:2512.02624·cs.CV·December 3, 2025

PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu

PDF

Open Access

TL;DR

PPTBench is a comprehensive benchmark designed to evaluate large language models' ability to understand and generate PowerPoint slide layouts, revealing significant gaps in visual-structural reasoning and coherence.

Contribution

This work introduces PPTBench, a new multimodal benchmark with diverse tasks and data, to assess LLMs' performance on PowerPoint layout and design understanding, addressing limitations of prior narrow-focused benchmarks.

Findings

01

Models interpret slide content but struggle with spatial arrangements.

02

Current MLLMs have difficulty integrating visual cues with layout structures.

03

Systematic errors like misalignment and overlap are common in generated slides.

Abstract

PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications