Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks
Sarah Pungitore, Shashank Yadav, David Maughan, and Vignesh Subbian

TL;DR
Lightweight language models frequently make reasoning errors in complex computational phenotyping tasks, and an expanded evaluation framework helps identify and analyze these faults to improve model reliability.
Contribution
This work introduces an extension to the PHEONA framework to evaluate reasoning errors in lightweight LLMs during complex phenotyping tasks, highlighting the prevalence of reasoning faults.
Findings
Reasoning errors are common across all tested models.
Prompt modifications have limited impact on reducing errors.
DeepSeek showed the smallest accuracy decline after prompt changes.
Abstract
Objective: Although computational phenotyping is a central informatics activity with resulting cohorts supporting a wide variety of applications, it is time-intensive because of manual data review. We previously assessed the ability of LLMs to perform computational phenotyping tasks using computable phenotypes for ARF respiratory support therapies. They successfully performed concept classification and classification of single-therapy phenotypes, but underperformed on multiple-therapy phenotypes. To understand issues with these complex tasks, we expanded PHEONA, a generalizable framework for evaluation of LLMs, to include methods specifically for evaluating faulty reasoning. Materials and Methods: We assessed the responses of three lightweight LLMs (DeepSeek-r1 32 billion, Mistral Small 24 billion, and Phi-4 14 billion) both with and without prompt modifications to identify explanation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
