Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy
Hongbin Na, Tao Shen, Shumao Yu, Ling Chen

TL;DR
This paper introduces IPAEval, a novel framework for evaluating psychotherapy outcomes from the client's perspective across multiple sessions, leveraging clinical interviews and a two-stage prompt scheme to improve assessment accuracy.
Contribution
It presents IPAEval, a new multi-session, client-centered evaluation method that incorporates cross-session and session-focused assessments using a structured, interpretable approach.
Findings
Outperforms baseline models in tracking symptom severity.
Effectively evaluates treatment progress over multiple sessions.
Validates the benefits of items-aware reasoning mechanisms.
Abstract
In psychotherapy, therapeutic outcome assessment, or treatment outcome evaluation, is essential to mental health care by systematically evaluating therapeutic processes and outcomes. Existing large language model approaches often focus on therapist-centered, single-session evaluations, neglecting the client's subjective experience and longitudinal progress across multiple sessions. To address these limitations, we propose IPAEval, a client-Informed Psychological Assessment-based Evaluation framework, which automates treatment outcome evaluations from the client's perspective using clinical interviews. It integrates cross-session client-contextual assessment and session-focused client-dynamics assessment for a comprehensive understanding of therapeutic progress. Specifically, IPAEval employs a two-stage prompt scheme that maps client information onto psychometric test items, enabling…
Peer Reviews
Decision·Submitted to ICLR 2025
Originality: The paper offers a novel approach to psychotherapy outcome evaluation by shifting from a therapist-centered, single-session paradigm to a client-centered, multi-session framework. This approach, embodied in the IPAEval framework, introduces a unique perspective within mental health assessments by prioritizing the client’s evolving experience across sessions. Quality: The paper's experimental design is comprehensive, testing multiple LLMs with various statistical metrics. The author
1: Despite having 110 and 800 client sessions available, the authors only annotated 30 for psychological assessment and 60 for treatment outcomes, using this small subset to establish a “Gold Model” for generating reference scores. This approach may introduce bias, as basing the Gold Model on such a small sample could affect the reliability of the labeled data, especially in representing the full diversity of client interactions. 2. The paper presents IPAEval as an improvement over models like
1. Proposes a generalizable pipeline transforming from client information/dialogue to metrics (treatment outcomes) 2. Conducts an ablation study on the pipeline, specifically on removing the item-aware reasoning (items classified by psychometric test, explanation) to see if it affects the performance of tracking both psychological assessments and treatment outcomes. 3. Performs human annotations for three tasks (psychological assessments: symptom detection and severity assessment; treatment outc
1. The main claim of the paper is to evaluate the treatment outcomes across sessions. However, there is no validation/exploration on the TheraPhase dataset to show that it is reflective of real world outcomes (i.e. to ensure the distributions of treatment outcomes across time is close to real-world clients). 2. The TheraPhase dataset is based on a Chinese counseling dialogue dataset. The performance difference between models (e.g. Llama vs. Qwen) may be affected by the language's comprehension
1. The paper focuses on evaluate therapeutic outcome via LLMs, aiming at solving a meaningful problem. 2. It integrates some psychology tests and metrics into LLM evaluation framework. It is very interesting.
1. The TheraPhase dataset is constructed via GPT-4. There is no human evaluation to guarantee the reliability of the dataset. 2. Experiments only conducted on one or two datasets. It would be better to evaluate on more datasets. 3. I think the proposed approach is not novel. Most parts in the approach are already existing, such as the psychometric test, client profile and PSDI. 4. In this paper, GPT-4o is selected as the Gold Model. However, as shown in Table 4, its performance is also poor. So
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychotherapy Techniques and Applications
MethodsFocus
