DIVINE: Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment

Mohd Mujtaba Akhtar; Girish; Muskaan Singh

arXiv:2601.07014·eess.AS·January 13, 2026

DIVINE: Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment

Mohd Mujtaba Akhtar, Girish, Muskaan Singh

PDF

Open Access 1 Video

TL;DR

This paper introduces DIVINE, a novel multimodal framework that disentangles shared and modality-specific features from audio-visual data to improve neurological disorder assessment accuracy and interpretability.

Contribution

DIVINE is the first framework to combine cross-modal disentanglement, adaptive fusion, and multitask learning for comprehensive neuro-facial disorder evaluation.

Findings

01

Achieves state-of-the-art accuracy of 98.26% on Toronto NeuroFace dataset.

02

Performs well under single-modality testing conditions.

03

Outperforms unimodal models and baseline fusion methods.

Abstract

In this study, we present a multimodal framework for predicting neuro-facial disorders by capturing both vocal and facial cues. We hypothesize that explicitly disentangling shared and modality-specific representations within multimodal foundation model embeddings can enhance clinical interpretability and generalization. To validate this hypothesis, we propose DIVINE a fully disentangled multimodal framework that operates on representations extracted from state-of-the-art (SOTA) audio and video foundation models, incorporating hierarchical variational bottlenecks, sparse gated fusion, and learnable symptom tokens. DIVINE operates in a multitask learning setup to jointly predict diagnostic categories (Healthy Control,ALS, Stroke) and severity levels (Mild, Moderate, Severe). The model is trained using synchronized audio and video inputs and evaluated on the Toronto NeuroFace dataset under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DIVINE : Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment· underline

Taxonomy

TopicsFacial Nerve Paralysis Treatment and Research · Voice and Speech Disorders · Face recognition and analysis