EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts
Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba

TL;DR
EncQA introduces a comprehensive benchmark for evaluating vision-language models on diverse visual encodings and tasks related to chart understanding, revealing gaps in current models' reasoning capabilities.
Contribution
We present EncQA, a new benchmark with synthetic question-answer pairs covering multiple visual encodings and tasks, highlighting the limitations of current models in visual reasoning for charts.
Findings
Model performance varies across encodings and tasks.
Scaling model size does not always improve performance.
Targeted strategies are needed to address reasoning gaps.
Abstract
Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
