RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Yang Bai, Christan Earl Grant, Daisy Zhe Wang

TL;DR
RAMQA introduces a unified framework that combines learning-to-rank and generative models to enhance multi-modal question answering by effectively integrating text and images, leading to significant performance improvements.
Contribution
It proposes a novel combination of ranking and generative techniques using LLaVA and LLaMA models for improved multi-modal QA.
Findings
Significant performance improvements on WebQA and MultiModalQA benchmarks.
Effective integration of ranking and generative models for multi-modal QA.
Demonstrated superiority over strong baseline methods.
Abstract
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need · LLaMA
