Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification

Saeed AlMarri; Mathieu Ravaut; Kristof Juhasz; Gautier Marti; Hamdan Al Ahbabi; Ibrahim Elfadel

arXiv:2512.00163·cs.LG·December 2, 2025

Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification

Saeed AlMarri, Mathieu Ravaut, Kristof Juhasz, Gautier Marti, Hamdan Al Ahbabi, Ibrahim Elfadel

PDF

Open Access 1 Video

TL;DR

This paper evaluates the faithfulness of LLMs' explanations using SHAP values in financial classification, revealing discrepancies and limitations but also potential for improved explainability in high-stakes domains.

Contribution

It systematically assesses LLMs' SHAP explanations on financial data, highlighting divergence from their self-explanations and traditional models, and discusses implications for deployment.

Findings

01

LLMs' SHAP values differ from their self-explanations.

02

Significant differences between LLMs and LightGBM SHAP values.

03

Limitations of LLMs as standalone classifiers in finance.

Abstract

Large Language Models (LLMs) have attracted significant attention for classification tasks, offering a flexible alternative to trusted classical machine learning models like LightGBM through zero-shot prompting. However, their reliability for structured tabular data remains unclear, particularly in high stakes applications like financial risk assessment. Our study systematically evaluates LLMs and generates their SHAP values on financial classification tasks. Our analysis shows a divergence between LLMs self-explanation of feature impact and their SHAP values, as well as notable differences between LLMs and LightGBM SHAP values. These findings highlight the limitations of LLMs as standalone classifiers for structured financial modeling, but also instill optimism that improved explainability mechanisms coupled with few-shot prompting will make LLMs usable in risk-sensitive domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Financial Distress and Bankruptcy Prediction · Artificial Intelligence in Healthcare and Education