MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with   Large Language Model

Le Zhang; Yihong Wu; Fengran Mo; Jian-Yun Nie; Aishwarya Agrawal

arXiv:2310.13265·cs.CL·October 23, 2023·1 cites

MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

Le Zhang, Yihong Wu, Fengran Mo, Jian-Yun Nie, Aishwarya Agrawal

PDF

Open Access 1 Repo

TL;DR

MoqaGPT is a flexible, zero-shot framework that enables large language models to perform multi-modal open-domain question answering by retrieving and fusing information from diverse modalities without complex ranking.

Contribution

It introduces a divide-and-conquer approach allowing LLMs to handle multiple modalities seamlessly in a zero-shot setting, improving performance on key datasets.

Findings

01

Significant performance improvements on MMCoQA dataset (+37.91 F1, +34.07 EM)

02

Outperforms zero-shot baselines on MultiModalQA (+9.5 F1, +10.1 EM)

03

Closes gap with supervised methods in multi-modal QA

Abstract

Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lezhang7/moqagpt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding