TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger

TL;DR
TikZilla is a new approach that combines high-quality datasets and reinforcement learning to improve the generation of TikZ figures from text descriptions, outperforming larger models in accuracy and fidelity.
Contribution
The paper introduces DaTikZ-V4 dataset and a two-stage training pipeline with reinforcement learning, enhancing Text-to-TikZ generation quality with smaller models.
Findings
TikZilla outperforms GPT-4o in human evaluations.
Reinforcement learning improves semantic accuracy.
Smaller models match GPT-5 in image-based evaluation.
Abstract
Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors newly construct a large-scale TikZ dataset, approximately four times larger than the existing dataset. - They define a reward function based on image similarity and demonstrate the effectiveness of reinforcement learning. - They conduct not only automatic evaluations but also human evaluations, demonstrating the effectiveness of the proposed method.
- In Section 4, the paper describes the collection of a new large-scale dataset; however, there appears to be no mention of its license information. Since license details are essential for enabling data reuse, it would be helpful to provide not only the total number of data samples but also a breakdown of the dataset by license type. - The correlation between the automatic evaluation metrics and the human evaluation results appears to be low, making it difficult to accurately assess the quality
1. This paper describes how to obtain a large scale text-to-tikz dataset. They use a combination of choices (e.g. 1 tikz per website; ensuring compilation; VLM style description) to obtain a large scale and high quality dataset. 2. They describe howt o use such a dataset for SFT and for RL. For RL, they add a couple of domain specific changes -- e.g. the model for the reward; the scalar rewards for capturing semantic alignment. 3. Their strategy seems to work -- leading to clear improvements
1. It would be good to quantify the difference in model size in Figure 4 -- how much smaller is the model than GPT-5/4o in terms of sheer size of the model as well as compute at inference time. 2. it would be good to plot / understand performance as a function of the size of the data. If we want better performance, can we just collect more data or are we already hitting diminishing returns here ? 3. Is the dataset planning to be made public? It would be useful for other researchers I imagine.
**Clarity:** The paper is well-written. It effectively communicates its contributions relative to other work in this area so that even a reader that is not familiar with the text-to-TikZ problem can appreciate the results. DaTikZ-V4 is well-motivated by an analysis of existing datasets (though the reviewer would have appreciated a few more examples). The figures are all of high quality and efficiently visualize key ideas in the paper. **Effective, small, specialized models:** One potential crit
**A lot of the paper feels routine:** The approach described in the paper appears to result in substantially better performance than prior small-model approaches. However, the methods used to achieve these improvements are exactly what most readers would likely expect: increase the size of the dataset, increase the quality of the dataset, implement recent approaches that have been successful in language modeling broadly (e.g., RL). One exception to this is the section on reward signals specific
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Generative Adversarial Networks and Image Synthesis
