Towards Explainable Bilingual Multimodal Misinformation Detection and Localization
Yiwei He, Zhenglin Huang, Haiquan Wen, Tianxiao Li, Yi Dong, Hao Fei, Baoyuan Wu, Guangliang Cheng

TL;DR
This paper presents BiMi, a novel bilingual multimodal framework for detecting and localizing misinformation in news content, incorporating cross-lingual and cross-modal analysis, with a new benchmark and interpretability enhancements.
Contribution
The paper introduces BiMi, a comprehensive bilingual multimodal misinformation detection system with a new large-scale benchmark and innovative explanation optimization using GRPO.
Findings
BiMi outperforms baselines with up to +8.9 accuracy
Achieves +15.9 localization accuracy improvement
Enhances explanation quality with +2.5 BERTScore
Abstract
The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Innovative Multimodal Framework: The introduction of the BiMi framework effectively combines bilingual and multimodal aspects, allowing for more nuanced analysis of misinformation that involves both images and subtitles in different languages. Natural Language Explanations: The ability to generate natural language explanations makes the model’s decision-making process more transparent and interpretable, enhancing user trust and understanding of the model’s outputs. Comprehensive Dataset: The r
Dependence on Data Quality: The performance of the BiMi framework heavily relies on the quality and accuracy of the input data (i.e., images and subtitles). In cases of low-quality or misleading inputs, the system's effectiveness may diminish. Interpretability Beyond Explanations: While the model provides natural language explanations, the underlying decision-making process and how different modalities interact might still lack transparency, making deeper interpretability a challenge.
1. The focus on bilingual (Chinese-English) subtitles reflects real-world content on major platforms such as Bilibili and YouTube, addressing a gap in existing multimodal misinformation detection research. 2. The proposed three-stage training pipeline (domain alignment → SFT → GRPO) is methodologically sound and effectively integrates multiple objectives into a unified framework. 3. Unlike many models that only classify misinformation, BiMi explicitly generates interpretable, step-by-step natu
1 The bilingual (Chinese–English) subtitles in the dataset are exact translations of each other, with no divergent or culturally nuanced content across languages. This raises questions about the necessity of the bilingual setting, as similar functionality could be achieved by translating existing monolingual datasets. Consequently, the contribution of constructing a new dataset appears incremental. 2 Insufficient Evaluation of Generalization: The model’s performance is evaluated only on in-doma
1. The paper addresses an underexplored problem—how to make multimodal misinformation detection not only accurate but also explainable. 2. The adoption of GRPO for refining reasoning is novel and well-motivated. By optimizing explanation quality through a learned reward model, the approach introduces a meaningful alternative to conventional supervised or PPO-based fine-tuning.
1. Insufficient dataset statistics and analysis. The paper does not report clear dataset statistics—such as the ratio of real vs. fake samples, textual length distribution, image diversity (scene/object categories), or modality correlation scores. These statistics are critical for understanding the coverage and bias of the dataset used for both pretraining and fine-tuning. 2. Unclear source and validation of “Explanation” annotations. The paper states that explanations are used as supervision du
1. BiMi achieves strong performance gains over competitive baselines on BiMiBench, demonstrating the effectiveness of the proposed framework. 2. The benchmark is large-scale with fine-grained manipulation categories and bilingual subtitle inconsistencies, enabling challenging evaluation beyond prior datasets. 3. The paper is well written and easy to follow.
1. The manipulations are created by prompts and automatic translation, which may not fully reflect how misinformation appears in real news or social media, where edits and cross-language differences happen in much more diverse and unpredictable ways. 2. The paper claims retrieval helps generalization, but the only evaluation they provide is on 100 real posts, where retrieval only helped in 9 cases. For the remaining 91%, retrieval either did not help or did nothing. So the evidence is too limi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
