A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Panagiotis Kaliosis; Adithya V Ganesan; Oscar N.E. Kjell; Whitney Ringwald; Scott Feltman; Melissa A. Carr; Dimitris Samaras; Camilo Ruggero; Benjamin J. Luft; Roman Kotov; Andrew H. Schwartz

arXiv:2602.06015·cs.CL·February 6, 2026

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Panagiotis Kaliosis, Adithya V Ganesan, Oscar N.E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz

PDF

Open Access

TL;DR

This study systematically evaluates how different contextual and modeling strategies impact the accuracy of large language models in estimating PTSD severity from narratives, highlighting the importance of detailed context and ensemble methods.

Contribution

It provides a comprehensive analysis of factors influencing LLM performance in mental health assessment, including context, reasoning, and ensemble strategies, across multiple models.

Findings

01

Detailed construct definitions improve accuracy.

02

Increased reasoning effort enhances estimation.

03

Ensembling models yields best performance.

Abstract

Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Posttraumatic Stress Disorder Research · Machine Learning in Healthcare