Evaluating LLM Understanding via Structured Tabular Decision Simulations
Sichao Li, Xinyue Xu, Xiaomeng Li

TL;DR
This paper introduces Structured Tabular Decision Simulations (STaDS), a new evaluation framework for assessing whether large language models truly understand decision factors across diverse domains, beyond mere accuracy.
Contribution
The paper presents STaDS, a novel suite of decision-based evaluation settings that measure LLM understanding through decision factor reliance and comprehension across multiple domains.
Findings
Most models struggle with consistent accuracy across domains
Models can be accurate but rely on incorrect decision factors
Frequent mismatches between rationales and actual decision drivers
Abstract
Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision ``exams''. In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Artificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods
