Piecing It All Together: Verifying Multi-Hop Multimodal Claims
Haoran Wang, Aman Rangapur, Xiongxiao Xu, Yueqing Liang, Haroon, Gharwi, Carl Yang, Kai Shu

TL;DR
This paper introduces a new challenging task and dataset for multi-hop multimodal claim verification, requiring models to reason over diverse evidence sources like text, images, and tables to verify claims.
Contribution
The paper presents the MMCV dataset with 15,000 multi-hop multimodal claims, generated with large language models and human feedback, and establishes a human performance benchmark.
Findings
State-of-the-art models struggle with multi-hop reasoning in MMCV
Increasing reasoning hops decreases model accuracy
Human performance sets a benchmark for future improvements
Abstract
Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 15k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
