DIVINE: Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment
Mohd Mujtaba Akhtar, Girish, Muskaan Singh

TL;DR
This paper introduces DIVINE, a novel multimodal framework that disentangles shared and modality-specific features from audio-visual data to improve neurological disorder assessment accuracy and interpretability.
Contribution
DIVINE is the first framework to combine cross-modal disentanglement, adaptive fusion, and multitask learning for comprehensive neuro-facial disorder evaluation.
Findings
Achieves state-of-the-art accuracy of 98.26% on Toronto NeuroFace dataset.
Performs well under single-modality testing conditions.
Outperforms unimodal models and baseline fusion methods.
Abstract
In this study, we present a multimodal framework for predicting neuro-facial disorders by capturing both vocal and facial cues. We hypothesize that explicitly disentangling shared and modality-specific representations within multimodal foundation model embeddings can enhance clinical interpretability and generalization. To validate this hypothesis, we propose DIVINE a fully disentangled multimodal framework that operates on representations extracted from state-of-the-art (SOTA) audio and video foundation models, incorporating hierarchical variational bottlenecks, sparse gated fusion, and learnable symptom tokens. DIVINE operates in a multitask learning setup to jointly predict diagnostic categories (Healthy Control,ALS, Stroke) and severity levels (Mild, Moderate, Severe). The model is trained using synchronized audio and video inputs and evaluated on the Toronto NeuroFace dataset under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFacial Nerve Paralysis Treatment and Research · Voice and Speech Disorders · Face recognition and analysis
