Confidence Scoring for LLM-Generated SQL in Supply Chain Data Extraction
Jiekai Ma, Yikai Zhao

TL;DR
This paper evaluates methods to estimate confidence in LLM-generated SQL queries for supply chain data, highlighting the limitations of self-reported confidence and the effectiveness of embedding-based similarity checks.
Contribution
It introduces and compares three approaches for confidence scoring in LLM-generated SQL, emphasizing the potential of embedding-based methods for accuracy assessment.
Findings
Embedding-based similarity effectively detects inaccurate SQL queries.
Self-reported confidence scores are often overconfident and unreliable.
Translation-based consistency checks show moderate effectiveness.
Abstract
Large Language Models (LLMs) have recently enabled natural language interfaces that translate user queries into executable SQL, offering a powerful solution for non-technical stakeholders to access structured data. However, one of the limitation that LLMs do not natively express uncertainty makes it difficult to assess the reliability of their generated queries. This paper presents a case study that evaluates multiple approaches to estimate confidence scores for LLM-generated SQL in supply chain data retrieval. We investigated three strategies: (1) translation-based consistency checks; (2) embedding-based semantic similarity between user questions and generated SQL; and (3) self-reported confidence scores directly produced by the LLM. Our findings reveal that LLMs are often overconfident in their own outputs, which limits the effectiveness of self-reported confidence. In contrast,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
