LLM Code Customization with Visual Results: A Benchmark on TikZ

Charly Reux (DiverSe); Mathieu Acher (DiverSe); Djamel Eddine Khelladi (DiverSe); Olivier Barais (DiverSe); Cl\'ement Quinton (SPIRALS)

arXiv:2505.04670·cs.SE·June 5, 2025

LLM Code Customization with Visual Results: A Benchmark on TikZ

Charly Reux (DiverSe), Mathieu Acher (DiverSe), Djamel Eddine Khelladi (DiverSe), Olivier Barais (DiverSe), Cl\'ement Quinton (SPIRALS)

PDF

1 Datasets

TL;DR

This paper introduces vTikZ, a benchmark for evaluating how well Large Language Models can customize code to produce specific visual results, revealing current limitations and guiding future research.

Contribution

The paper presents vTikZ, the first benchmark for assessing LLMs' ability to modify code for visual outcomes, including curated scenarios and a visual feedback review tool.

Findings

01

State-of-the-art LLMs struggle with visual-aligned code modifications.

02

Current AI code editing methods have significant reliability gaps.

03

vTikZ enables new research in visual feedback-driven code customization.

Abstract

With the rise of AI-based code generation, customizing existing code out of natural language instructions to modify visual results -such as figures or images -has become possible, promising to reduce the need for deep programming expertise. However, even experienced developers can struggle with this task, as it requires identifying relevant code regions (feature location), generating valid code variants, and ensuring the modifications reliably align with user intent. In this paper, we introduce vTikZ, the first benchmark designed to evaluate the ability of Large Language Models (LLMs) to customize code while preserving coherent visual outcomes. Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness. Empirical evaluation with stateof-the-art LLMs shows that existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CharlyR/vtikz
dataset· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN