VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT

Hyeonsu Kang; Emily Bao; Anjan Goswami

arXiv:2510.22045·cs.CV·October 28, 2025

VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT

Hyeonsu Kang, Emily Bao, Anjan Goswami

PDF

TL;DR

VLM-SlideEval is a comprehensive framework for assessing vision-language models' ability to understand, extract, and interpret structured content and narrative in presentation slides, revealing current limitations and guiding future improvements.

Contribution

The paper introduces VLM-SlideEval, a novel evaluation framework specifically designed for slide content understanding and robustness testing of VLMs.

Findings

01

VLMs struggle with pixel-accurate element extraction from slides.

02

VLMs maintain some robustness under controlled perturbations.

03

VLMs are less effective at understanding narrative structure across multiple slides.

Abstract

Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo (https://huggingface.co/datasets/Forceless/Zenodo10K/viewer/default/pptx), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.