PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
Mayank Ravishankara

TL;DR
PlotChain introduces a deterministic benchmark for evaluating multimodal LLMs on engineering plot reading, emphasizing sub-skill diagnostics and reproducibility.
Contribution
It presents a new generator-based, checkpointed evaluation protocol with exact ground truth and failure localization for multimodal models on engineering plots.
Findings
Top models achieve over 78% field-level pass rate.
Frequency-domain tasks remain challenging for current models.
The benchmark and evaluation tools are publicly released for reproducibility.
Abstract
We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or free-form captioning. PlotChain contains 15 plot families with 450 rendered plots (30 per family), where every item is produced from known parameters and paired with exact ground truth computed directly from the generating process. A central contribution is checkpoint-based diagnostic evaluation: in addition to final targets, each item includes intermediate 'cp_' fields that isolate sub-skills (e.g., reading cutoff frequency or peak magnitude) and enable failure localization within a plot family. We evaluate four state-of-the-art MLLMs under a standardized, deterministic protocol (temperature = 0…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
