Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

Rahul Raja; Arpita Vats

arXiv:2510.20193·cs.IR·October 24, 2025

Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

Rahul Raja, Arpita Vats

PDF

TL;DR

This survey reviews recent multimedia-aware question answering systems that integrate vision, audio, and text modalities, discussing architectures, datasets, challenges, and future research directions for more robust, context-aware QA solutions.

Contribution

It categorizes and analyzes recent approaches in multimedia-aware QA, highlighting key challenges and outlining future research directions in the field.

Findings

01

Identification of key architectures and retrieval methods

02

Analysis of benchmark datasets and evaluation protocols

03

Discussion of challenges like cross-modal alignment and latency-accuracy tradeoffs

Abstract

Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.