MLlm-DR: Towards Explainable Depression Recognition with MultiModal Large Language Models

Wei Zhang; Juan Chen; En Zhu; Wenhong Cheng; YunPeng Li; Yanbo J. Wang

arXiv:2507.05591·cs.AI·March 19, 2026

MLlm-DR: Towards Explainable Depression Recognition with MultiModal Large Language Models

Wei Zhang, Juan Chen, En Zhu, Wenhong Cheng, YunPeng Li, Yanbo J. Wang

PDF

TL;DR

This paper introduces MLlm-DR, a multimodal large language model designed for explainable depression diagnosis from interview videos, combining multimodal data processing with interpretability, and achieving state-of-the-art results.

Contribution

The paper presents a novel multimodal LLM architecture with a lightweight query module and a specialized training dataset for improved, explainable depression diagnosis.

Findings

01

Achieves state-of-the-art results on CMDC and E-DAIC-WOZ datasets.

02

Effectively integrates speech and visual data for depression analysis.

03

Provides interpretable evaluation rationales for diagnoses.

Abstract

Automated depression diagnosis aims to analyze multimodal information from interview videos to predict participants' depression scores. Previous studies often lack clear explanations of how these scores were determined, limiting their adoption in clinical practice. While the advent of LLMs provides a possible pathway for explainable depression diagnosis, current LLMs capable of processing multimodal data lack training on interview data, resulting in poor diagnostic performance when used directly. In this paper, we propose a novel multimodal large language model (MLlm-DR) that can understand multimodal information inputs and supports explainable depression diagnosis. MLlm-DR integrates a smaller LLMs and a lightweight query module (LQ-former). Specifically, the smaller LLMs is designed to generate depression scores and corresponding evaluation rationales. To enhance its logical reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.