# Alignment of Large Language Model Responses With Human Therapists in Motivational Interviewing

**Authors:** Bazen Gashaw Teferra, Sandra Huang, Nabil Johny, Argyrios Perivolaris, Huda Al-Shamali, Karisa Parkington, Alice Rueda, Richard J. Zeifman, Divya Sharma, Sri Krishnan, Candice Monson, Venkat Bhat

PMC · DOI: 10.1001/jamanetworkopen.2026.2750 · 2026-03-23

## TL;DR

This study evaluates how well a large language model's responses match those of human therapists during motivational interviewing sessions, finding moderate contextual appropriateness but limited semantic alignment.

## Contribution

The study introduces a method to assess LLM alignment with human therapists in motivational interviewing using automated similarity metrics.

## Key findings

- LLMs showed higher contextual appropriateness than semantic similarity in therapist-like responses.
- Alignment improved in sessions with greater therapist topic consistency.
- LLM performance declined slightly over longer conversations with signs of reduced contextual grounding.

## Abstract

This cross-sectional study uses automated similarity metrics to examine how closely the responses of a large language model align with human therapist responses in motivational interviewing conversations.

Can a large language model (LLM) generate therapist responses that align with human therapist turns in motivational interviewing (MI)–oriented conversations?

In this cross-sectional study of 154 high-fidelity MI sessions (3706 therapist turns), the LLM showed low semantic similarity to therapist responses but higher contextual appropriateness. Alignment was significantly higher in sessions with greater therapist topic consistency and declined modestly over longer conversations.

The findings suggest LLMs can produce contextually appropriate MI-consistent responses, but limitations in coherence and stylistic alignment highlight the need for further validation before clinical use.

Large language models (LLMs) are increasingly applied to mental health contexts, yet their capacity to generate responses that align with evidence-based psychotherapy remains uncertain. Motivational interviewing (MI), a structured counseling approach, provides an empirically grounded setting for evaluating alignment between LLM-generated and human therapist responses.

To evaluate how closely an LLM’s responses align with therapist responses in MI sessions, using automated similarity metrics.

This cross-sectional study used high-fidelity therapist-client transcripts annotated with the Motivational Interviewing Treatment Integrity system. Transcripts were sourced from publicly available counseling videos. For each therapist turn, the GPT-4o LLM generated a response using a standardized, MI-informed prompt based on the preceding conversation context. Analyses were conducted between March and May 2025.

Alignment between LLM-generated and therapist responses was assessed using (1) cosine similarity based on sentence embeddings to capture semantic overlap and (2) DeepEval, a contextual deep-learning–based metric assessing coherence and contextual appropriateness. A therapist topic-consistency index quantified within-session thematic coherence and was examined as a moderator of alignment.

A total of 3706 therapist turns from 154 MI sessions were evaluated. Mean (SD) DeepEval scores were higher than mean (SD) cosine similarity scores (0.72 [0.31] vs 0.29 [0.20]; P < .001), suggesting limited semantic overlap despite greater contextual appropriateness. Therapist topic consistency significantly moderated similarity, where cosine similarity was higher in high-consistency than low-consistency sessions (mean [SD] difference, 0.027 [0.007]; t3706 = 3.987; P < .001), as was DeepEval score (mean [SD] difference, 0.038 [0.010]; t3706 = 3.747; P < .001). Correlation between metrics was negligible (Spearman ρ, –0.01), indicating that they captured distinct aspects of response alignment. LLM performance declined slightly across longer conversations (mean [SD] slope reduction for cosine similarity, −0.0005 [0.0016], and for DeepEval, −0.0005 [0.0022]), with increased verbosity and signs of reduced contextual grounding.

In this cross-sectional study of 154 MI sessions, prompted LLMs showed general alignment with therapist responses in MI-oriented conversations, as judged by automated similarity metrics. However, limitations in long-range coherence, stylistic alignment, and the use of indirect proxies for therapeutic quality highlight the need for improved prompt design, MI-specific evaluation methods, and clinical validation before integration into mental health care.

## Full-text entities

- **Diseases:** chronic disease (MESH:D002908), LLM (MESH:D007806), anxiety (MESH:D001007), depression (MESH:D003866), Disparities (MESH:D011019), MI (MESH:D003072), substance use (MESH:D019966)
- **Chemicals:** ELIZA (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13010193/full.md

---
Source: https://tomesphere.com/paper/PMC13010193