RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question   Answering

Yang Bai; Christan Earl Grant; Daisy Zhe Wang

arXiv:2501.13297·cs.CL·January 24, 2025

RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering

Yang Bai, Christan Earl Grant, Daisy Zhe Wang

PDF

Open Access 1 Repo

TL;DR

RAMQA introduces a unified framework that combines learning-to-rank and generative models to enhance multi-modal question answering by effectively integrating text and images, leading to significant performance improvements.

Contribution

It proposes a novel combination of ranking and generative techniques using LLaVA and LLaMA models for improved multi-modal QA.

Findings

01

Significant performance improvements on WebQA and MultiModalQA benchmarks.

02

Effective integration of ranking and generative models for multi-modal QA.

03

Demonstrated superiority over strong baseline methods.

Abstract

Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tonyby/ramqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · LLaMA