Multimodal Hypothetical Summary for Retrieval-based Multi-image Question   Answering

Peize Li; Qingyi Si; Peng Fu; Zheng Lin; Yan Wang

arXiv:2412.14880·cs.CV·December 20, 2024

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MHyS, a multimodal hypothetical summary approach that enhances retrieval-based multi-image question answering by transforming images into text summaries, improving retrieval accuracy and overall QA performance.

Contribution

The paper presents a novel multimodal summarization method that replaces real images with text summaries, reducing modality gaps and improving retrieval in multi-image QA tasks.

Findings

01

Achieved 3.7% absolute improvement on RETVQA

02

Improved retrieval accuracy by transforming images into text summaries

03

Demonstrated effectiveness through comprehensive experiments and ablations

Abstract

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

peizeli96/MHys
pytorch

Videos

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Contrastive Learning · Contrastive Language-Image Pre-training · ALIGN