Chart-CoCa: Self-Improving Chart Understanding of Vision LMs via Code-Driven Synthesis and Candidate-Conditioned Answering
Gongyao Jiang, Qiong Luo

TL;DR
This paper introduces Chart-CoCa, a self-improving framework for chart understanding in vision language models that uses code-driven synthetic data generation and candidate-conditioned answering to enhance accuracy without human labels.
Contribution
It presents a novel pipeline combining code-based chart synthesis and candidate-conditioned answering, enabling self-improvement without human-labeled data.
Findings
Achieves up to 15.50 points accuracy improvement.
Demonstrates effectiveness of synthetic data generation for chart understanding.
Shows self-improving paradigm enhances VLM performance.
Abstract
Vision Language Models (VLMs) often struggle with chart understanding tasks, particularly in accurate chart description and complex reasoning. Synthetic data generation is a promising solution, while usually facing the challenge of noise labels. To address this challenge, we first introduce a chart synthesis pipeline that generates aligned chart-question-answer triplets through code generation and execution, ensuring the reliability of synthetic data without human intervention. Furthermore, inspired by test-time scaling that increases inference budget and thereby improves performance, we design a candidate-conditioned answering process. The VLM first generates multiple responses per query, and then synthesizes the final answer by contextualizing these candidates. Experiments demonstrate significant improvements, with up to 15.50 points accuracy gain over the initial VLM, in a fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Semantic Web and Ontologies
