GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

Abhigya Verma; Sriram Puttagunta; Seganrasan Subramanian; Sravan Ramachandran

arXiv:2508.15690·cs.AI·December 3, 2025

GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran

PDF

1 Datasets

TL;DR

GRAFT is a comprehensive benchmark for evaluating multimodal large language models on structured visual reasoning and instruction following using programmatically generated charts and tables with multi-step analytical questions.

Contribution

It introduces a new structured multimodal benchmark with a taxonomy of reasoning operations, enabling detailed assessment of models' visual and textual reasoning abilities.

Findings

01

Provides a scalable framework for multimodal reasoning evaluation

02

Supports fine-grained analysis of reasoning processes

03

Establishes a standard for future multimodal benchmarks

Abstract

GRAFT is a structured multimodal benchmark designed to probe how well LLMs handle instruction following, visual reasoning, and tasks requiring tight visual textual alignment. The dataset is built around programmatically generated charts and synthetically rendered tables, each paired with a carefully constructed, multi step analytical question that depends solely on what can be inferred from the image itself. Responses are formatted in structured outputs such as JSON or YAML, enabling consistent and fine grained evaluation of both reasoning processes and adherence to output specifications. The benchmark further introduces a taxonomy of reasoning operations ranging from comparison and trend identification to ranking, aggregation, proportional estimation, and anomaly detection to support a comprehensive assessment of model capabilities. Taken together, GRAFT provides a unified and scalable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ServiceNow-AI/GRAFT_benchmark
dataset· 155 dl
155 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.