FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Amirhossein Abaskohi; Spandana Gella; Giuseppe Carenini; Issam H. Laradji

arXiv:2412.07030·cs.CL·September 16, 2025

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

PDF

Open Access 1 Repo 1 Video

TL;DR

FM2DS is a novel framework that synthesizes high-quality multimodal multihop question-answering datasets, enabling improved training and evaluation of models on complex, long, multimodal documents.

Contribution

We introduce FM2DS, the first comprehensive pipeline for creating high-quality multimodal multihop QA datasets, including a new benchmark for long documents.

Findings

01

Models trained on our synthesized data outperform those trained on human data by 1.9 EM score.

02

Our dataset and benchmark facilitate better training and evaluation for MMQA tasks.

03

FM2DS enables the creation of high-quality, long multimodal documents for QA research.

Abstract

Multimodal multihop question answering (MMQA) requires reasoning over images and text from multiple sources. Despite advances in visual question answering, this multihop setting remains underexplored due to a lack of quality datasets. Existing methods focus on single-hop, single-modality, or short texts, limiting real-world applications like interpreting educational documents with long, multimodal content. To fill this gap, we introduce FM2DS, the first framework for creating a high-quality dataset for MMQA. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure data quality. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks: MultimodalQA and WebQA. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

servicenow/fm2ds
tfOfficial

Videos

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering· underline

Taxonomy

TopicsText and Document Classification Technologies

MethodsFocus