Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Rahul Raja, Arpita Vats

TL;DR
This survey reviews recent multimedia-aware question answering systems that integrate vision, audio, and text modalities, discussing architectures, datasets, challenges, and future research directions for more robust, context-aware QA solutions.
Contribution
It categorizes and analyzes recent approaches in multimedia-aware QA, highlighting key challenges and outlining future research directions in the field.
Findings
Identification of key architectures and retrieval methods
Analysis of benchmark datasets and evaluation protocols
Discussion of challenges like cross-modal alignment and latency-accuracy tradeoffs
Abstract
Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
