A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data

Aram Ansary Ogholbake; Hannah Choi; Spencer Brandenburg; Alyssa Antuna; Zahraa Al-Sharshahi; Makayla Cox; Haseeb Ahmed; Jacqueline Frank; Nathan Millson; Luke Bauerle; Jessica Lee; David Dornbos III; Qiang Cheng

arXiv:2603.26726·cs.CV·March 31, 2026

A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data

Aram Ansary Ogholbake, Hannah Choi, Spencer Brandenburg, Alyssa Antuna, Zahraa Al-Sharshahi, Makayla Cox, Haseeb Ahmed, Jacqueline Frank, Nathan Millson, Luke Bauerle, Jessica Lee, David Dornbos III, Qiang Cheng

PDF

TL;DR

This paper introduces AttentionMixer, a deep learning framework that effectively combines brain CT images and clinical data for improved edema detection, emphasizing interpretability and robustness.

Contribution

The novel AttentionMixer model fuses multimodal data using cross-attention and MLP-Mixer, achieving superior accuracy and interpretability in edema classification.

Findings

01

AttentionMixer outperforms existing baselines with 87.32% accuracy.

02

Cross-attention enhances dynamic feature modulation based on clinical context.

03

Model demonstrates robustness to incomplete clinical metadata.

Abstract

We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.