TL;DR
This paper investigates why current vision-language models struggle with understanding data visualizations, identifying that errors mainly originate from the vision-to-language information transfer, and highlights architectural limitations affecting performance.
Contribution
The study introduces FUGU, a suite of tasks to diagnose visualization understanding issues in VLMs, and uses activation patching and probes to pinpoint error sources, revealing key architectural constraints.
Findings
Models often mispredict data point coordinates, leading to errors.
Providing correct coordinates improves performance, indicating errors occur during vision-language transfer.
Architectural constraints limit VLMs' ability to reliably understand complex visual data.
Abstract
Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Clear motivation and positioning:** The paper addresses a relevant and timely question regarding the ability of VLMs to understand charts and data visualizations. The research problem is articulated clearly, focusing on identifying where failures in chart understanding may originate. - **Well-scoped contributions:** The proposed FUGU tasks are thoughtfully designed and cover a useful range of chart-understanding skills (e.g., counting, coordinates, extrema). The use of both causal i
- **Limited coverage of visualization types:** The analysis focuses mainly on Cartesian point-based charts (e.g., line and bar charts), where positional relationships naturally reflect values. However, for non-Cartesian visualizations such as pie charts or radar charts, angular information is equally critical. The current framework does not appear to account for these cases, limiting its generality across broader visualization types. - **Insufficient consideration of visual encoder scale an
1. Introduces FUGU, a diagnostic benchmark designed to "unit test" the fine-grained capabilities of VLMs on data visualizations. 2. Provides a clear and localized diagnosis for VLM failures, identifying the vision-language connector and early LM layers as the primary bottleneck. 3. The work appears reproducible due to clear descriptions of the FUGU benchmark tasks and the diagnostic methods.
* The FUGU benchmark's scope is currently narrow, focusing on synthetic scatter plots (Sec 3.1). This makes it unclear if the findings and the identified bottleneck generalize to other common chart families (e.g., bar charts, line graphs, histograms) or to more complex, real-world visualizations with varied aesthetics, occlusions, or multiple panels. * The paper's contribution is primarily diagnostic. While it successfully identifies the location of the information bottleneck (Sec 5.3), it does
+ The dataset design is simple, clear, and well-controlled, allowing for a clean analysis of specific model behaviors and error sources + The combination of multiple analysis methods provides different aspects to assess model behavior + The conclusion brought by linear probs experiments is interesting and convincing
- The dataset scope is narrow. FUGU focuses on scatterplots with limited data points, no occlusion, and fixed glyphs. Real charts usually include bars/lines, partial occlusion, diverse scaled axes, and additional legends/annotations. It remains unclear whether the experimental results and conclusions would still hold for other chart types or in-the-wild data.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
