# Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and Usability Study

**Authors:** Zheng Jin, Jiaxing Hu, Dandan Bi, Kaibin Zhao, Huan Yu

PMC · DOI: 10.2196/78401 · JMIR Formative Research · 2026-01-13

## TL;DR

This study explores using AI chatbots for depression screening, finding that they can be as effective as traditional methods while being more engaging for users.

## Contribution

The study introduces an AI-based interactive assessment framework for depression screening that combines natural language processing with traditional psychometric tools.

## Key findings

- The BDI-FS-GPT showed excellent diagnostic accuracy (AUC=0.953) with high sensitivity and low false-positive rate.
- Participants reported significantly higher satisfaction with the AI-based assessment compared to traditional scales.
- The AI tool demonstrated substantial agreement with clinical depression diagnoses (κ=0.72).

## Abstract

The evolution of language models, particularly large language models, has introduced transformative potential for psychological assessment, challenging traditional rating scale methods that have dominated clinical practice for over a century.

This study aimed to develop and validate an automated assessment paradigm that integrates natural language processing with conventional measurement tools to assess depressive symptoms, exploring its feasibility as a novel approach in psychological evaluation.

A cohort of 115 participants, including 28 (24.3%) individuals diagnosed with depression, completed the Beck Depression Inventory Fast Screen via a custom ChatGPT interface (BDI-FS-GPT) and the Chinese version of the Patient Health Questionnaire–9 (PHQ-9). Statistical analyses included the Spearman correlation (PHQ-9 vs BDI-FS-GPT scores), Cohen κ (diagnostic agreement), and area under the curve (AUC) evaluation.

Spearman analysis revealed a moderate correlation between PHQ-9 and BDI-FS-GPT scores. The Cohen κ indicated moderate diagnostic agreement between the PHQ-9 and the BDI-FS-GPT (κ=0.43; 76.5% agreement), substantial agreement between the BDI-FS-GPT and the clinical diagnosis (κ=0.72; 88.7% agreement), and moderate agreement between the PHQ-9 and the clinical diagnosis (κ=0.55; 71.4% agreement). The BDI-FS-GPT demonstrated excellent diagnostic accuracy (AUC=0.953) at a cutoff of 3, detecting 89.3% of participants with depression with an 11.5% false-positive rate compared to the PHQ-9 (AUC=0.859) at a cutoff of 5 (sensitivity=71.4%; false-positive rate=13.8%). Participants also reported significantly higher satisfaction with the automated assessment compared to the traditional scale (P=.02).

The automated assessment paradigm framework combines the interactivity and personalization of natural language processing–powered tools with the psychometric rigor of traditional scales, suggesting a preliminary feasibility paradigm for future psychological assessment. Its ability to enhance engagement while maintaining reliability and validity provides encouraging evidence, warranting validation in larger and more diverse studies as large language model technology advances.

RR2-10.1101/2024.07.19.24310543

## Linked entities

- **Diseases:** depression (MONDO:0002050)

## Full-text entities

- **Diseases:** LLMs (MESH:D007806), posttraumatic stress disorder (MESH:D013313), Mental Disorders (MESH:D001523), Beck Depression (MESH:D057767), DSM-IV (MESH:D006011), AI (MESH:C538142), Crisis (MESH:D001752), Depression (MESH:D003866), suicidal ideation (MESH:D001072)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12848484/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12848484/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12848484/full.md

---
Source: https://tomesphere.com/paper/PMC12848484