QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

Shuxiang Cao; Zijian Zhang; Abhishek Agarwal; Grace Bratrud; Niyaz R. Beysengulov; Daniel C. Cole; Alejandro G\'omez Frieiro; Elena O. Glen; Hao Hsu; Gang Huang; Raymond Jow; Greshma Shaji; Tom Lubowe; Ligeng Zhu; Luis Mantilla Calder\'on; Nicola Pancotti; Joel Pendleton; Brandon Severin; Charles Etienne Staub; Sara Sussman; Antti Veps\"al\"ainen; Neel Rajeshbhai Vora; Yilun Xu; Varinia Bernales; Daniel Bowring; Elica Kyoseva; Ivan Rungger; Giulia Semeghini; Sam Stanwyck; Timothy Costa; Al\'an Aspuru-Guzik; Krysta Svore

arXiv:2604.25884·quant-ph·April 29, 2026

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov, Daniel C. Cole, Alejandro G\'omez Frieiro, Elena O. Glen, Hao Hsu, Gang Huang, Raymond Jow, Greshma Shaji, Tom Lubowe, Ligeng Zhu, Luis Mantilla Calder\'on, Nicola Pancotti, Joel Pendleton

PDF

TL;DR

This paper introduces QCalEval, a benchmark for vision-language models to interpret quantum calibration plots, revealing strengths and limitations of current models in this specialized domain.

Contribution

It presents the first systematic evaluation of VLMs on quantum calibration plots, including a new benchmark and analysis of model performance in zero-shot and in-context settings.

Findings

01

Best zero-shot model scores 72.3 on average.

02

In-context learning degrades performance for many models.

03

Supervised fine-tuning improves zero-shot scores but not in-context learning.

Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.