Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Xanh Ho; Yun-Ang Wu; Sunisth Kumar; Florian Boudin; Atsuhiro Takasu; Akiko Aizawa

arXiv:2511.10075·cs.CL·November 14, 2025

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa

PDF

Open Access 1 Video

TL;DR

This paper evaluates the robustness of multimodal large language models in verifying scientific claims from tables and charts, revealing strengths with tables but significant challenges with charts, and highlighting the need for improved multimodal reasoning.

Contribution

The study adapts existing scientific datasets for multimodal claim verification and systematically assesses 12 models, uncovering key gaps in chart understanding and cross-modal generalization.

Findings

01

Models perform better with table evidence.

02

Models struggle with chart evidence.

03

Humans maintain high performance across formats.

Abstract

With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts· underline

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Advanced Text Analysis Techniques