Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

Suhas BN; Andrew M. Sherrill; Jyoti Alaparthi; Dominik Mattioli; Rosa I. Arriaga; Chris W. Wiese; and Saeed Abdullah

arXiv:2506.09707·eess.AS·December 22, 2025

Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, and Saeed Abdullah

PDF

Open Access

TL;DR

This paper introduces a scalable, privacy-preserving method for automatically localizing key therapy elements in PE sessions using fine-tuned large audio-language models, aiding fidelity assessment and quality control.

Contribution

It presents a novel approach to automatically identify therapy phase boundaries in PE sessions by fine-tuning a large audio-language model with LoRA, using soft supervision from LLM prompts.

Findings

01

Achieves a mean absolute error of 5.3 seconds in boundary detection.

02

Demonstrates effectiveness on 308 real PE sessions.

03

Highlights the importance of context window size and model adaptation.

Abstract

Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements, identifying their start and stop times, directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases, therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3), are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Digital Mental Health Interventions · Mental Health via Writing