PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Mayank Ravishankara

arXiv:2602.13232·cs.AI·April 23, 2026

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Mayank Ravishankara

PDF

TL;DR

PlotChain introduces a deterministic benchmark for evaluating multimodal LLMs on engineering plot reading, emphasizing sub-skill diagnostics and reproducibility.

Contribution

It presents a new generator-based, checkpointed evaluation protocol with exact ground truth and failure localization for multimodal models on engineering plots.

Findings

01

Top models achieve over 78% field-level pass rate.

02

Frequency-domain tasks remain challenging for current models.

03

The benchmark and evaluation tools are publicly released for reproducibility.

Abstract

We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or free-form captioning. PlotChain contains 15 plot families with 450 rendered plots (30 per family), where every item is produced from known parameters and paired with exact ground truth computed directly from the generating process. A central contribution is checkpoint-based diagnostic evaluation: in addition to final targets, each item includes intermediate 'cp_' fields that isolate sub-skills (e.g., reading cutoff frequency or peak magnitude) and enable failure localization within a plot family. We evaluate four state-of-the-art MLLMs under a standardized, deterministic protocol (temperature = 0…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.