Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Md Raisul Kibria; S\'ebastien Lafond; Janan Arslan

arXiv:2508.04427·cs.LG·April 28, 2026

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Md Raisul Kibria, S\'ebastien Lafond, Janan Arslan

PDF

TL;DR

This systematic review examines the adoption of explainability techniques in multimodal attention-based models, highlighting current challenges and providing recommendations for more rigorous evaluation practices.

Contribution

It offers a comprehensive analysis of recent multimodal XAI research, identifying gaps and proposing standardized evaluation and reporting guidelines.

Findings

01

Most studies focus on vision-language and language-only models.

02

Attention-based techniques are most commonly used for explanation.

03

Evaluation methods are often inconsistent and lack robustness.

Abstract

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.