CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

Qixuan Hu; Shuchang Ye; Xumou Zhang; Anastasia Serafimovska; Anastasia Suraev; Amit Saha; Ping-hsiu Lin; Sydney Su; Usman Naseem; Adam G. Dunn; Jinman Kim

arXiv:2605.17370·cs.AI·May 20, 2026

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim

PDF

TL;DR

This paper introduces CBT-Audio, a new dataset of spoken CBT sessions with distress labels, and evaluates how audio language models can improve patient distress estimation beyond text alone.

Contribution

The paper presents CBT-Audio, a novel dataset for spoken CBT sessions, and demonstrates that combining audio with transcripts enhances distress estimation accuracy.

Findings

01

Audio models outperform text-only models in distress estimation.

02

Adding audio improves model performance in 8 out of 10 cases.

03

Case studies highlight benefits when verbal content and vocal cues diverge.

Abstract

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.