See it. Say it. Sorted: Agentic System for Compositional Diagram Generation
Hantao Zhang, Jingyang Liu, Ed Li

TL;DR
This paper introduces a novel, training-free agentic system that combines vision-language and large language models to convert rough sketches into precise, editable SVG diagrams, improving layout fidelity and structural accuracy.
Contribution
The proposed system uniquely integrates VLMs and LLMs in an iterative, qualitative reasoning loop for sketch-to-diagram conversion, emphasizing global constraints and human-in-the-loop capabilities.
Findings
Outperforms GPT-5 and Gemini-2.5-Pro in reconstructing flowchart sketches.
Accurately composes complex primitives like multi-headed arrows.
Supports human-in-the-loop corrections and is extensible to presentation tools.
Abstract
We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics
