SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Brandon Collins; Logan Bolton; Hung Huy Nguyen; Mohammad Reza Taesiri; Trung Bui; Anh Totti Nguyen

arXiv:2604.22875·cs.CV·April 29, 2026

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

PDF

2 Repos 3 Datasets

TL;DR

SketchVLM introduces a training-free framework that enables vision-language models to generate editable SVG overlays on images, providing visual explanations that improve interpretability and accuracy across various visual reasoning tasks.

Contribution

It presents a novel, model-agnostic method for visual explanations using SVG overlays, enhancing interpretability without additional training or fine-tuning.

Findings

01

Improves visual reasoning accuracy by up to 28.5 percentage points.

02

Enhances annotation quality by up to 1.48x compared to baselines.

03

Single-turn generation achieves strong results, with multi-turn further improving collaboration.

Abstract

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.