Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling
Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou, Pascal Frossard

TL;DR
This paper presents SliDer, a novel framework using Vision-Language Models to convert raster slide images into editable SVG formats, preserving semantic structure and enabling better document editing.
Contribution
Introduces SliDer, a new method leveraging Vision-Language Models for semantic derendering of slide images into structured SVGs, along with the Slide2SVG dataset for future research.
Findings
SliDer achieves a reconstruction LPIPS of 0.069.
Human evaluators prefer SliDer in 82.9% of cases.
Outperforms zero-shot VLM baseline in semantic reconstruction.
Abstract
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
