MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio,, Akash Maharaj, Yunyao Li

TL;DR
MuRAR is a framework that improves multimodal question answering by retrieving relevant data and refining answers to produce coherent, useful, and readable multimodal responses, especially for enterprise and educational applications.
Contribution
Introduces MuRAR, a simple framework that enhances multimodal answers in QA systems through retrieval and refinement, addressing limitations of previous text-focused approaches.
Findings
Multimodal answers are more useful and readable than plain text answers.
MuRAR effectively integrates multimodal data for comprehensive responses.
Human evaluations favor MuRAR-generated answers over traditional methods.
Abstract
Recent advancements in retrieval-augmented generation (RAG) have demonstrated impressive performance in the question-answering (QA) task. However, most previous works predominantly focus on text-based answers. While some studies address multimodal data, they still fall short in generating comprehensive multimodal answers, particularly for explaining concepts or providing step-by-step tutorials on how to accomplish specific goals. This capability is especially valuable for applications such as enterprise chatbots and settings such as customer service and educational systems, where the answers are sourced from multimodal data. In this paper, we introduce a simple and effective framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
Methodstravel james · Focus
