Exposing Text-Image Inconsistency Using Diffusion Models
Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu

TL;DR
This paper presents D-TIIL, a diffusion model-based method that localizes and explains text-image inconsistencies to combat misinformation, supported by a new dataset for detailed evaluation.
Contribution
Introduces D-TIIL, a novel diffusion model approach for explainable localization of text-image inconsistencies, and provides a new dataset for detailed assessment.
Findings
D-TIIL effectively localizes semantic inconsistencies.
The TIIL dataset enables word- and region-level evaluation.
D-TIIL outperforms existing methods in explainability.
Abstract
In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL…
Peer Reviews
Decision·ICLR 2024 poster
1. The task studied in this paper is meaningful. 2. The dataset that they collected is contributive to the community. 3. The method is novel.
1. The writing is not very good. I read the methodology part several hours to understand their pipeline. 2. The idea is well justified for the inconsistency of object alignment. But what if the predicate is not aligned, i.e. the person is correct but the action is not?
1. Originality: The paper introduces a novel method, D-TIIL, that exposes text-image inconsistency with the location of inconsistent image regions and words. Also, the new TIIL dataset is the first dataset with pixel-level and word-level inconsistency features that provide fine-grained and reliable inconsistency. 2. Quality: The D-TIIL and TIIL dataset generation are thoroughly described. The paper also provides a comprehensive comparison of the proposed method with existing approaches. 3. Cla
1. The paper acknowledges that the D-TIIL may struggle with inconsistencies with respect to specific external knowledge, and this could reduce the effectiveness of the method in real-world application. 2. The D-TIIL method relies heavily on the text-to-image diffusion models and benefits a lot from the semantic space that is already well aligned. This dependence could limit the generalizability of the proposed method. 3. There are some confusing details in the method description section. 4. I
• The paper is well written and well structured • The problem and the related work are well introduced • The framework is explained in detail • The idea to build consistency scores between stable diffusion and the original image is interesting.
• The general theoretical idea behind the approach lacks clearity • The real-world application is not very clear, e.g. wrong labels have a different type of mislabeling than just objects that are swapped • Sensitivity to threshold highly influences M and the consistency score With D-TIIL, the authors have presented an interesting method for using diffusion models to evaluate the consistency of image-text pairs. However, the utility of the method is not fully evaluated in detail. Deeper insights
Code & Models
Videos
Taxonomy
TopicsMisinformation and Its Impacts · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
MethodsDiffusion
