impuTMAE: Multi-modal Transformer with Masked Pre-training for Missing Modalities Imputation in Cancer Survival Prediction

Maria Boyko; Aleksandra Beliaeva; Dmitriy Kornilov; Alexander Bernstein; Maxim Sharaev

arXiv:2508.09195·eess.IV·August 14, 2025

impuTMAE: Multi-modal Transformer with Masked Pre-training for Missing Modalities Imputation in Cancer Survival Prediction

Maria Boyko, Aleksandra Beliaeva, Dmitriy Kornilov, Alexander Bernstein, Maxim Sharaev

PDF

3 Reviews

TL;DR

impuTMAE is a transformer-based model that effectively imputes missing modalities in multimodal medical data, improving glioma survival prediction by leveraging pre-training on incomplete data and integrating diverse data types.

Contribution

It introduces a novel multimodal transformer with masked pre-training that handles missing data and enhances prognostic accuracy in cancer survival prediction.

Findings

01

Achieves state-of-the-art performance on glioma survival datasets.

02

Effectively imputes missing modalities during pre-training.

03

Outperforms prior multimodal approaches.

Abstract

The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The idea of using MAE to address missing modality challenge is suitable and straightforward 2. The experimental results are proving the effectiveness of this method.

Weaknesses

1. The evaluation is not convincing due to the fact that the comparing baselines in this paper are severely limited. There are plenty of recent survival prediction methods, and the authors only choose MoME, CMTA, MultiSurv. In addition, the authors should consider MOTCAT, MCAT, PIBD, SurvPath, MMP, etc. Specifically, in CVPR2025, DisPro addresses the missing modality challenge in survival prediction, which the authors should compare their methods with. 2. There are no ablation studies on the co

Reviewer 02Rating 2Confidence 5

Strengths

1. The core idea of extending the masked autoencoder (MAE) paradigm to a heterogeneous multimodal setting (i.e., genomics and imaging) is a non-trivial and valuable contribution. 2. This work addresses an important problem in medical AI—clinical datasets are almost always incomplete.

Weaknesses

1. The paper's state-of-the-art claim is unreliable due to a critical flaw in the experimental setup. As explicitly stated in Section 4, the authors apply their own proposed imputation method to all prior approaches. This invalidates the comparison in Table 3. The experiment does not compare impuTMAE against the original MOME or MultiSurv, but rather against those models modified and augmented with impuTMAE's imputer. This makes the reported SOTA results unreliable. 2. In Table 3, the full mul

Reviewer 03Rating 2Confidence 5

Strengths

+ The paper directly tackles a critical, real-world problem in medical AI: incomplete multimodal datasets. + The model is tested on multiple cancer types with varying data availability.

Weaknesses

- The technical innovation in the presented work is limited. MAE-based pre-training followed by fine-tuning for survival analysis is not a new training paradigm. - While the model demonstrably improves reconstruction and prediction, the paper offers limited analysis into how the cross-modal interactions occur. - Results of baselines (e.g., direct fine-tuning without pre-training) should be included. Results of SOTA methods for the missing modality should be included in the comparison study. - It

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.