Charts Are Not Images: On the Challenges of Scientific Chart Editing
Shawn Li, Ryan Rossi, Sungchul Kim, Sunav Choudhary, Franck Dernoncourt, Puneet Mathur, Zhengzhong Tu, Yue Zhao

TL;DR
This paper introduces igEdit, a large-scale benchmark for scientific chart editing that emphasizes the importance of structured data understanding over pixel manipulation, highlighting current model limitations and guiding future research.
Contribution
The paper presents igEdit, a comprehensive benchmark with diverse chart types and tasks, to evaluate and advance structure-aware scientific figure editing models.
Findings
State-of-the-art models perform poorly on structured chart edits.
Traditional metrics like SSIM and PSNR are insufficient for semantic correctness.
Current models mainly excel at pixel-level manipulations, not structured transformations.
Abstract
Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits,…
Peer Reviews
Decision·ICLR 2026 Poster
- **Quality:** Builds a **large, well-controlled benchmark (30K+ charts)** generated via deterministic Vega rendering, with clear task taxonomy and reproducible evaluation. - **Clarity:** The paper is well-organized and visually clear, using intuitive figures and radar plots to illustrate the **gap between pixel similarity and semantic correctness** - **Significance:** Establishes the first **semantics-aware benchmark** for chart editing, providing a valuable testbed for evaluating multimodal an
- **Synthetic generation bias vs. real-chart curation.** FigEdit’s base figures and edits are produced via **LLM-guided Vega/Vega-Lite specs** and rendered images, which can drift from real publication practices; several prior sets rely on **human-curated real charts** and manual validation (e.g., ChartEdit’s 1,405 instructions on 233 real charts). - **Subjectivity/noise in LLM-based scoring.** The paper’s “semantics-aware” evaluation relies on **LLM judgement** for instruction following/content
1. **Excellent Problem Formulation:** The paper's primary strength is its clear and insightful formulation of scientific chart editing as a "structured transformation" problem governed by a graphical grammar. This conceptual shift from pixel-manipulation to structure-awareness is crucial and correctly identifies a fundamental mismatch in current approaches. 2. **High-Quality, Comprehensive Benchmark:** The introduction of FigEdit is a significant contribution. The benchmark is large-scale, di
1. **Limited Coverage of Edit Operations:** The set of atomic edits, while canonical, appears somewhat limited. The paper focuses on operations like `add_datapoint`, `change_background_color`, and `increase_text_size` (Table 3, Appendix C). However, real-world chart editing often involves more complex structural changes, such as changing chart type (e.g., bar to line), reordering categories, grouping/ungrouping data, or modifying axis scales (e.g., linear to log). The current operation set may
- The paper proposes a novel problem formulation (figure editing) that could have significant and practical real-world impact across a variety of industries, like research and business. - The paper rethinks traditional pixel-based image metrics (LPIPS, PSNR, SSIM, etc.) for better evaluating the specific task of figure editing. This is an important research direction for other generative vision tasks, as well. - The dataset is constructed from diverse real-world data across different fields, wh
- There is limited evaluation / justification for why the LLM-based metric is better or more reliable than traditional metrics. A comparison between all these scores and human evaluation would be beneficial for showing whether or not the LLM metric is actually better for evaluating figure edits. - Similarly, it would be informative to compare the LLM score per-category (Instr., Preserv., Qual.) to average human score per-category. - Fig. 1. The fonts are quite small, especially the axes of the c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Data Visualization and Analytics · Multimodal Machine Learning Applications
