TL;DR
This paper introduces the first benchmark for extracting claims from multimodal social media posts, evaluating current models and proposing a new intent-aware framework to improve claim extraction accuracy.
Contribution
It presents a novel benchmark dataset for multimodal claim extraction and introduces MICE, a framework that enhances model performance on intent-critical cases.
Findings
State-of-the-art multimodal LLMs struggle with rhetorical intent and context.
MICE improves claim extraction in intent-critical scenarios.
Benchmark dataset reflects real-world social media misinformation.
Abstract
Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today's misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
