The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference
Ma\"el Jullien, Marco Valentino, and Andr\'e Freitas

TL;DR
This paper introduces a benchmark to evaluate LLMs' reasoning in clinical NLP, revealing they often possess relevant knowledge but lack the structured internal representations needed for reliable inference.
Contribution
It presents a novel Clinical Trial Natural Language Inference benchmark with targeted probes to dissociate factual access from inference failures in LLMs.
Findings
LLMs perform well on knowledge verification but poorly on reasoning tasks
Inferences are consistent but often rely on heuristics and shortcuts
Current LLMs lack the structured, composable representations for reliable reasoning
Abstract
Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters. We interrogate this assumption by introducing a Clinical Trial Natural Language Inference benchmark comprising four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probe, allowing us to dissociate failures of factual access from failures of inference. We evaluate six contemporary LLMs under both direct and chain of thought prompting. Models achieve near-ceiling GKMRV accuracy (mean accuracy 0.918) yet perform poorly on the main reasoning tasks (mean accuracy 0.25). Despite low accuracy, output inferences are highly consistent across samples (mean 0.87), indicating a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
