Chart Deep Research in LVLMs via Parallel Relative Policy Optimization
Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang, Xing Chen

TL;DR
This paper introduces PRPO, a parallel reward optimization method, and MCDR-Bench, an evaluation framework, to enhance deep research capabilities in LVLMs for chart understanding, addressing training and evaluation limitations.
Contribution
The paper presents a novel parallel reward optimization technique and a new benchmark for objective evaluation, advancing deep research in LVLMs for chart analysis.
Findings
PRPO effectively disentangles multi-dimensional reward signals.
MCDR-Bench enables objective assessment of deep research capabilities.
Experimental results show improved model performance in deep chart understanding.
Abstract
With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual…
Peer Reviews
Decision·ICLR 2026 Poster
- MCDR-Bench is an impactful dataset that can fill an important research gap in academia: a lack of a proper benchmark to evaluate deep research capabilities. Not only does it allow researchers to systematically evaluate the chart understanding ability of multimodal LLMs, but it also contributes to assessing LLMs' deep research report generation capabilities. - The proposed RL training framework for chart deep research is technically sound. The authors pinpoint two key limitations of GRPO that
- If the authors could verify that PRPO can be used in a model-agnostic manner by training models other than Qwen2.5 with PRPO, it would make the paper even stronger. - While PRPO greatly advances the optimization algorithm of GRPO, it still relies on the combination of accuracy + format reward. Do the authors anticipate that PRPO could benefit further from more refined reward shaping? For instance, what would happen if we add more vision-centric rewards, such as those proposed in [1,2] to PRPO
- Well written paper. - Comprehensive evaluations
- Related works can be improved
1. The problem and benchmark proposed in the paper are of great significance for the research on deep chart question answering and analytical reasoning. 2. The paper conducts sufficient experiments to verify the effectiveness of the proposed methods. 3. The paper elaborates on the proposed methods in detail. 4. The paper provides code for verifying the experimental results, and this practice is commendable.
1. The organizational structure of the paper is suboptimal. 2. Hyperparameter sensitivity: The paper mentions hyperparameters λ_k and λ_m but fails to discuss the sensitivity of the results to their settings, lacking a robustness analysis. 3. There is no detailed comparison of the time and resource consumption of different methods. 4. The comparison with baseline methods of the same type is mainly limited to GRPO. However, there are more recent state-of-the-art (SOTA) methods available, such
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Multimodal Machine Learning Applications
