Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Adam Hazimeh; Ke Wang; Mark Collier; Gilles Baechler; Efi Kokiopoulou; Pascal Frossard

arXiv:2511.13478·cs.CV·November 18, 2025

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou, Pascal Frossard

PDF

Open Access

TL;DR

This paper presents SliDer, a novel framework using Vision-Language Models to convert raster slide images into editable SVG formats, preserving semantic structure and enabling better document editing.

Contribution

Introduces SliDer, a new method leveraging Vision-Language Models for semantic derendering of slide images into structured SVGs, along with the Slide2SVG dataset for future research.

Findings

01

SliDer achieves a reconstruction LPIPS of 0.069.

02

Human evaluators prefer SliDer in 82.9% of cases.

03

Outperforms zero-shot VLM baseline in semantic reconstruction.

Abstract

Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques