Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin

TL;DR
This paper introduces Visual Self-Refine, a paradigm enabling models to iteratively visualize and self-correct their visual perception errors, significantly improving accuracy in complex chart parsing tasks.
Contribution
We propose the Visual Self-Refine paradigm and instantiate it with ChartVSR, a model that iteratively refines pixel-level localizations for accurate chart parsing, along with a new challenging benchmark.
Findings
ChartVSR achieves higher accuracy in chart parsing.
VSR improves visual perception accuracy through iterative self-correction.
The new benchmark challenges existing models and promotes progress.
Abstract
While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This…
Peer Reviews
Decision·ICLR 2026 Poster
The new dataset seems nice and useful for chart parsing evals. VSR is an interesting recipe and focusing on pixel-level annotations is a nice instantiation of this setup for chart parsing. Some of the improvements seem compelling, particularly the information dense charts. If the extra calls are prohibitive for an inference pipeline, this recipe can probably be used to create distillation data.
While the recipe is interesting, it’s not very general and will probably become outdated for this task as models’ visual understanding improves over time. It seems strange that performance doesn’t improve much after a step or two of refinement even though there’s so much headroom. Why is this? Maybe annotations should be adjusted or focused on incorrect ones? Or doing step-by-step correction is necessary? Either way, it seems like the feedback and the recipe need some…refinement.
1. The paper presents a novel and interesting paradigm for chart parsing, introducing a visually grounded self-correction mechanism that enhances interpretability and addresses an existing gap in LVLM perception. 2. The authors introduce a high-quality dataset, ChartP-Bench, which is carefully curated, diverse in style, and fills an important gap in chart parsing evaluation. 3. The ablation studies are comprehensive, providing thorough analyses of the effects of pixel localization and refinement
1. The chart parsing paradigm proposed in this paper can be viewed as a type of reasoning paradigm. However, the experimental section lacks comparisons with other recent visual reasoning models, such as o1, Qwen3-VL, and InternVL-3.5. I understand that some of these models might not have been publicly available at the time of submission, but I recommend that the authors include such comparisons in future revisions to strengthen the solidity and comprehensiveness of the work. 2. Around line 405,
A visual self-refinement paradigm, Visual Self-Refine (VSR), is proposed: the model first generates localization points, visualizes them, and then feeds them back to the model for self-checking and error correction. In the graph parsing task, the process is divided into two stages: Refine and Decode. A challenging benchmark, ChartP-Bench, is constructed, and ChartQA-SE is cleaned to obtain ChartQA-SE-Clean. Significant performance is reported on multiple benchmarks, especially outperforming st
There are limited benchmarks for evaluating papers, lacking authoritative datasets like Chart-Pro and ChartXiv. This method has limited nooverty, and its two-stage design is very similar to the design philosophy of SoM. Many previous works on visual prompts have demonstrated that such visual prompts can improve performance. It lacks a crucial baseline, such as a comparison of the localization ability of the first-stage model with that of other grounding models on chart localization tasks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
