TL;DR
This paper identifies a vulnerability in multimodal large language models where their accuracy drops significantly on misleading visualizations, and proposes effective inference-time methods to improve robustness without sacrificing performance on truthful charts.
Contribution
It uncovers a key vulnerability in MLLMs regarding misleading visualizations and compares six inference-time methods, highlighting two effective approaches to enhance robustness.
Findings
MLLM QA accuracy drops to random baseline on misleading charts
Two inference-time methods improve accuracy by up to 19.6 percentage points
Code and data are made publicly available for further research
Abstract
Visualizations play a pivotal role in daily communication in an increasingly data-driven world. Research on multimodal large language models (MLLMs) for automated chart understanding has accelerated massively, with steady improvements on standard benchmarks. However, for MLLMs to be reliable, they must be robust to misleading visualizations, i.e., charts that distort the underlying data, leading readers to draw inaccurate conclusions. Here, we uncover an important vulnerability: MLLM question-answering (QA) accuracy on misleading visualizations drops on average to the level of the random baseline. To address this, we provide the first comparison of six inference-time methods to improve QA performance on misleading visualizations, without compromising accuracy on non-misleading ones. We find that two methods, table-based QA and redrawing the visualization, are effective, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
