It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models

Xiangyu Zhao; Yaling Shen; Yiwen Jiang; Zimu Wang; Jiahe Liu; Maxmartwell H Cheng; Guilherme C Oliveira; Robert Desimone; Dominic Dwyer; Zongyuan Ge

arXiv:2511.19877·cs.MM·December 12, 2025

It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models

Xiangyu Zhao, Yaling Shen, Yiwen Jiang, Zimu Wang, Jiahe Liu, Maxmartwell H Cheng, Guilherme C Oliveira, Robert Desimone, Dominic Dwyer, Zongyuan Ge

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a multi-modal large language model that integrates visual cues into audio-based depression detection, enhancing accuracy by aligning audio-visual features at a fine-grained level, and demonstrating superior performance on a standard dataset.

Contribution

The study presents a novel multi-modal LLM framework that combines visual understanding with audio models for depression detection, improving temporal modeling and reducing training requirements.

Findings

01

Outperforms single-modality models on DAIC-WoZ dataset

02

Effective fine-grained alignment of audio-visual features

03

Framework adaptable to additional physiological signals

Abstract

Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

+ Good empirical results. The model achieves consistent gains across modalities, outperforming both single-modality and previous multi-modal methods for this specific task. + Parameter efficiency. By leveraging LoRA and QLoRA, the authors effectively reduce computational cost while maintaining performance, making the approach feasible for research and clinical use.

Weaknesses

- In Figures 1 & 2, a wave is used to denote the video input, which is really inappropriate. - Unclear Design. The model fuses audio + visual embeddings by simple element-wise addition. Was any normalization or linear projection used to match feature scales? Did the authors compare addition vs concatenation or MLP fusion, or other fusion methods? - Another concern is the design choice of merging the tokens from the two modalities before feeding them into the LLM, rather than inputting each mod

Reviewer 02Rating 4Confidence 3

Strengths

The approach of augmenting an existing audio language model is a practical method for developing a multi-modal system. The timestamp-level alignment of audio and visual information is a logical design for processing behavioral data. The sequential three-stage training process, which involves self-supervised pretraining of the visual encoder, cross-modal alignment, and parameter-efficient fine-tuning, is a structured way to integrate a new modality.

Weaknesses

The model relies on pre-extracted visual features from the dataset, not raw video input. This limits the evaluation of the visual component to the quality of these specific features and does not demonstrate an ability to learn from unprocessed video. The data augmentation technique removes interviewer audio and video, which, while focusing on participant data, discards conversational context that might influence participant behavior. The evaluation is conducted on a single dataset, DAIC-WoZ, whi

Reviewer 03Rating 4Confidence 5

Strengths

**1.** The overall task and motivation behind the work are clearly defined. **2.** The visuals are clear and informative. **3.** The experiment setup and rationales behind each evaluation are clearly stated. I also appreciate the sub-subsections under Section 4.3.

Weaknesses

**1.** A major concern regarding this work is its novelty. The multimodal approach to depression detection is not a new concept, as various encoders have been used for feature extraction across different modalities. Additionally, many previous studies have been conducted on the DAIC-WOZ dataset with similar frameworks. **2.** One suggestion for improvement is to conduct more experiments using other multimodal datasets for depression, particularly those representing diverse languages and demogra

Reviewer 04Rating 4Confidence 5

Strengths

- The topic is important and highly relevant to both AI and mental health research. - The integration of audio and visual modalities into an LLM for depression detection sounds reasonable. - The experimental results show performance improvement on the DAIC-WOZ dataset, demonstrating the method’s potential.

Weaknesses

Limited novelty: The contribution beyond existing MLLM-based depression detection frameworks appears marginal. The model primarily combines known components (audio LLM + visual encoder + timestamp alignment) without introducing fundamentally new mechanisms or theoretical insights. Single dataset evaluation: The method is only evaluated on the DAIC-WOZ dataset, which limits generalizability and makes it difficult to assess robustness. Other publicly available datasets such as DVlog could strengt

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Mental Health via Writing · Digital Mental Health Interventions