TL;DR
XFacta is a new, real-world multimodal misinformation dataset and evaluation framework that helps improve detection methods by addressing current limitations and continuously updating with new content.
Contribution
The paper introduces XFacta, a contemporary dataset for multimodal misinformation detection, and provides a comprehensive evaluation of MLLM-based strategies with a semi-automatic updating framework.
Findings
MLLM-based detectors show varied performance across architectures
Existing benchmarks are outdated or synthetic, limiting real-world relevance
Continuous updating improves detection accuracy over time
Abstract
The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The dataset’s contemporaneity (2024–2025) addresses memorization bias in current LLMs, ensuring fair evaluation beyond training knowledge leakage. 2. he authors use optimal transport alignment and topic matching to reduce distributional bias between real and fake samples, producing a credible and well-balanced corpus. 3. The dual-axis study (evidence retrieval vs. reasoning) provides a structured framework that clarifies where current MLLMs fail, representing a crucial step toward principled
1. The process and consistency of manual review and evidence annotation are described briefly; inter-annotator agreement metrics and annotation protocols are missing. 2. The authors emphasize contemporaneity but do not systematically quantify how model performance degrades on older vs. newer misinformation, which would strengthen their central claim. 3. The annotations and the retrieved evidence may share the same sources (e.g., journalists’ debunking pages). Without strict separation between tr
1 Introduces XFACTA, a novel dataset with real-world misinformation samples from X, avoiding temporal leakage and memorization bias common in outdated benchmarks. 2 Provides a clear disentanglement of the roles of evidence retrieval and reasoning in detection performance, offering actionable insights into model bottlenecks. 3 Proposes a semi-automatic detection-in-the-loop framework that continuously updates the dataset with new verified cases, ensuring long-term relevance and scalability.
1 While the dataset is curated with recent and real-world examples, its reliance on Twitter (X) as the primary source follows a well-established practice in prior work. The advancement over existing datasets is incremental. 2 The proposed semi-automatic detection-in-the-loop framework lacks a comprehensive methodological description. A clearer overview of the full detection pipeline, particularly with a schematic illustration or step-by-step workflow, would improve transparency and reproducibil
1. Evaluating MLLMs on their ability to detect misinformation is an extremely important problem. Specifically, the multimodal aspect of misinformation is a challenging problem to solve and will require excellent multimodal reasoning from the MLLM. 2. The dataset presented is likely to be useful for practitioners seeking to construct AI systems for flagging misinformation online.
1. There are several other similar benchmarks for evaluating capabilities of MLLM to detect misinformation (cited in the paper itself). The authors argue that the dataset consists of contemporary posts unlike other benchmarks. In my opinion, this is insufficient because the presented benchmark may also get outdated soon in a couple of years. It would be better to construct a continuously evolving benchmark for this (similar to LiveBench/LiveCodeBench) . 2. It would also be good to include the U
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
