TL;DR
This paper introduces a new adversarial framework leveraging multimodal large language models to generate and detect sophisticated, semantically coherent multimedia disinformation, addressing limitations of current detection methods.
Contribution
It proposes the MDSM dataset and the AMD framework with artifact-aware encoding and reasoning to improve detection of MLLM-driven multimedia deception.
Findings
AMD achieves 88.18% accuracy in cross-domain tests
Superior generalization over existing methods
Effective detection of semantically coherent manipulations
Abstract
The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1.The paper clearly highlights the risks posed by semantically coherent forgeries generated by modern Multimodal Large Language Models (MLLMs). 2.It presents a large-scale, semantically aligned multimodal dataset, effectively filling a crucial gap in resources for studying MLLM-driven misinformation. 3.The proposed framework integrates Artifact Pre-perception Encoding (APE) and Manipulation-Oriented Reasoning (MOR), leveraging the reasoning capabilities of MLLMs to collaboratively analyze image-
1.The paper aims to tackle the challenge of forgery detection in semantically aligned image-text scenarios. However, its model design primarily focuses on visual forgeries while largely overlooking textual forgeries. 2.Limited interpretability: although some visualizations are provided, the paper does not clearly demonstrate which specific forgery cues the model captures, nor does it offer a human-understandable reasoning process behind its decisions. 3.The organization of Section 2.2 could be i
1. The motivation is clear. The paper explicitly argues that prior work assumes crude cross-modal inconsistency, which makes detection too easy because the text and image obviously disagree. By contrast, MDSM uses an MLLM to generate fluent, contextually aligned fake narratives that match the manipulated visual identity. This is a meaningful direction. 2. AMD outputs manipulation decisions and the tampered region coordinates as a single textual answer instead of separate detection heads. This m
1. The novelty claimed based on the flaw of previous works is not strong enough. Prior work like FKA-Owl is already an MLLM-style system that “incorporates more world knowledge to improve the model’s cross-domain performance,” explicitly targeting multimodal fake news scenarios. The paper acknowledges this but still claims current approaches “fail to account for sophisticated misinformation synthesized by MLLMs,” which is not persuasive enough. 2. On the model side, AMD is essentially Florence
1. The paper identifies and formalizes the "coherence trap," a highly relevant and critical issue in the era of advanced generative models, where the very coherence that makes AI-generated content useful also makes it dangerously deceptive. 2. The construction of MDSM is a significant contribution. Its scale, diversity of sources, and alignment between modalities make it a valuable resource for the research community. 3. The paper presents thorough experiments, including ablation studies and c
1. While the paper formally defines and highlights the "coherence trap" as a critical challenge in multimodal misinformation detection, the underlying concept of detecting semantically aligned fake content is not entirely novel. Prior works, such as MMFakeBench, have already explored scenarios involving coherent image-text manipulations. 2. The proposed AMD framework is built upon the powerful Florence-2 model and leverages its strong pre-trained multimodal understanding. The architectural inn
* I believe this paper is relevant and has a few strengths to it. There are some interesting experiments that are done with both a zero-shot setting and training models on the MDSM dataset. * Authors spent time trying to do LoRA finetuning on their dataset which I think was an important experiment the have included. * Showcasing how other models like HAMMER and FKA-Owl which are prevalent in multi-modal manipulations was good to have and some discussion of the analysis * Including a human eva
* I believe that the authors should try and include more Open-Source multimodal models, for zero-shot evaluation in Table 2, currently the only model present is Qwen and no other popular models like Deepseek, LLaVa, Yi-VL. [1] Deepseek llm: Scaling open-source language models with longtermism. [2] Visual instruction tuning, Neurips 2023 [3] Yi: Open foundation models by 01.ai, 2024.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
