Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
Sijie Mai, Shiqin Han, Haifeng Hu

TL;DR
This paper introduces a unified framework that simultaneously addresses missing and noisy modalities in low-quality multimodal data, significantly improving robustness and performance in affective computing tasks.
Contribution
The paper proposes the UMQ framework that jointly handles missing and noisy modalities, incorporating a quality estimator, quality enhancer, and a quality-aware mixture-of-experts module.
Findings
UMQ outperforms state-of-the-art baselines on multiple datasets.
The quality estimator effectively ranks modality quality without absolute labels.
The framework improves robustness in complete, missing, and noisy modality scenarios.
Abstract
Multimodal data encountered in real-world scenarios are typically of low quality, with noisy modalities and missing modalities being typical forms that severely hinder model performance and robustness. However, prior works often handle noisy and missing modalities separately. In contrast, we jointly address missing and noisy modalities to enhance model robustness in low-quality data scenarios. We regard both noisy and missing modalities as a unified low-quality modality problem, and propose a unified modality-quality (UMQ) framework to enhance low-quality representations for multimodal affective computing. Firstly, we train a quality estimator with explicit supervised signals via a rank-guided training strategy that compares the relative quality of different representations by adding a ranking constraint, avoiding training noise caused by inaccurate absolute quality labels. Then, a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Comprehensive experiments: The authors evaluate the approach under multiple conditions (complete, missing, and noisy modalities) and across several datasets. 2. Ablation studies and visualizations: They carefully analyze the effect of each component (estimator, enhancer, MQ-MoE) and visualize the improvement qualitatively.
Major: 1. The manuscript is poorly written, with awkward phrasing and overly long sentences (e.g., in the abstract) that obscure the main ideas. Clarity is a major problem throughout. 2. The paper lacks a formal definition or analysis showing why missing modalities can be treated as a subclass of noisy modalities from an information-theoretic or probabilistic perspective. 3. Besides manually added noise, can the proposed method also handle naturally occurring ones, such as those caused by po
1. UMQ outperforms the compared methods on the benchmarks. 2. The concept of developing a unified approach to handle both noisy and missing modalities is technically sound and addresses an important problem
1. Overall, although the proposed framework reports state-of-the-art performance, it appears rather complex, with multiple tightly coupled components and hyperparameters. Beyond the reported results, it remains unclear whether this work meaningfully advances the field or stimulates further discussion within the research community. 2. The design of the UMQ framework is not theoretically grounded. Its design lacks an in-depth mathematical justification or toy-example simulation on how and why eac
- **Originality / Idea quality**. Casting both missing and noisy modalities as "low-quality" and learning an ordinal quality estimator with explicit anchors plus rank-guided training is thoughtful; it avoids brittle absolute labels and fits naturally with routing. - **Architecture design**. The decoupling into sample-specific vs. modality-specific subspaces and the quality enhancer that borrows cross-modal sample-specific information are intuitive and empirically helpful via illustrated experime
- **Concerns about high-quality anchors**. The "highest-quality" anchor uses a low unimodal predictive loss threshold. This might not be a robust indicator, as it may risk quality to reflect ease of the task label for that modality, or potentially rewarding label-leakage or majority cues rather than modality fidelity. - **Noise realism**. The paper primarily uses Gaussian feature corruption and treats missing as "extreme noise". In fact, degradation could come from misalignment (e.g. between au
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications
