When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making
Nazia Riasat

TL;DR
This paper reveals that large language models can appear stable across runs but still produce incorrect or misleading results in scientific decision-making tasks, emphasizing the need for explicit validation.
Contribution
It introduces a framework to evaluate LLM decision-making across stability, correctness, prompt sensitivity, and output validity, highlighting limitations of stability as a sole metric.
Findings
LLMs can be stable yet diverge from ground truth
Minor prompt changes can significantly alter outputs
LLMs may produce plausible but incorrect identifiers
Abstract
Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex-…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Explainable Artificial Intelligence (XAI)
