When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Nazia Riasat

arXiv:2603.15840·cs.LG·March 18, 2026

When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Nazia Riasat

PDF

Open Access

TL;DR

This paper reveals that large language models can appear stable across runs but still produce incorrect or misleading results in scientific decision-making tasks, emphasizing the need for explicit validation.

Contribution

It introduces a framework to evaluate LLM decision-making across stability, correctness, prompt sensitivity, and output validity, highlighting limitations of stability as a sole metric.

Findings

01

LLMs can be stable yet diverge from ground truth

02

Minor prompt changes can significantly alter outputs

03

LLMs may produce plausible but incorrect identifiers

Abstract

Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex-…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Explainable Artificial Intelligence (XAI)