Calibrated Confidence Estimation for Tabular Question Answering

Lukas Voss

arXiv:2604.12491·cs.CL·April 15, 2026

Calibrated Confidence Estimation for Tabular Question Answering

Lukas Voss

PDF

TL;DR

This paper systematically compares confidence estimation methods for large language models in tabular question answering, introducing Multi-Format Agreement (MFA) which improves calibration and reduces costs.

Contribution

It presents the first comprehensive comparison of confidence estimation methods for LLMs on tabular data and proposes MFA, a novel cost-effective calibration technique.

Findings

01

All models are overconfident in tabular QA.

02

MFA reduces ECE by 44-63%.

03

MFA + self-consistency ensemble improves AUROC from 0.74 to 0.82.

Abstract

Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.