Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
Lixiong Qin, Yang Zhang, Mei Wang, Jiani Hu, Weihong Deng, Weiran Xu

TL;DR
This paper introduces FiFa, a framework that enhances fine-grained, explainable DeepFake analysis by leveraging a new annotation pipeline, a dedicated task, and a multi-task model supporting detailed visual and textual forgery explanations.
Contribution
It proposes a novel data annotation method, a new Artifact-Grounding Explanation task, and a multi-task learning architecture for fine-grained, explainable DeepFake detection.
Findings
Outperforms strong baselines on the AGE task
Achieves state-of-the-art on existing XDFA datasets
Provides more reliable and detailed forgery explanations
Abstract
The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper makes a commendable attempt to advance DeepFake analysis beyond simple binary real vs. fake detection to fine-grained, explainable localization and description. - The proposed FiFa-MLLM architecture is a thoughtful effort to create a unified, multi-task model that can handle diverse inputs like bounding box queries.
- FiFa is designed specifically for DeepFakes created using "attribute manipulation" techniques. The authors state that this is because other methods, like identity or expression swapping, create pixel-level changes that are not localized to the artifact, making their artifact detection method unreliable. This limits the training data for fine-grained explanation to a single class of forgery. In fact there are several methods in the literature that can detect and localize identity or expression
1. The paper leverages multimodal large language models for deepfake annotation and considers diverse tasks including bounding box-level queries, which advances the explainability of deepfake detection. The multi-granularity task design (image-level, region-level, and box-level) enables fine-grained forgery analysis. 2. The authors contribute a datasetwith comprehensive annotations covering 11 different tasks, including novel artifact-grounding explanations that interleave textual descriptions
1. The paper only considers attribute manipulation techniques (primarily FaceApp) for creating fake samples, excluding other common deepfake types such as identity swapping, expression swapping, and entire face synthesis in the data annotation pipeline (FiFa-Annotator). This narrow focus on a single forgery method may hinder the model's generalization capability to diverse deepfake techniques encountered in real-world scenarios. 2. The data generation approach (using masks + large language mode
1. The work makes progress toward fine-grained explainability in deepfake analysis through textual and visual artifact reasoning. 2. The hierarchical concept tree and annotation pipeline are well-motivated and improve annotation precision. 3. The unified architecture for multimodal framework is well-designed, which integrates multi-task learning without requiring multiple encoders. 4. Experimental results across multiple datasets demonstrates the framework’s effectiveness and data reliability
Major Weaknesses 1. The semantic consistency between generated textual explanations and segmentation masks is not deeply validated. While the multi-task decoders output both modalities, the linguistic and visual results may be unaligned. 2. The FiFa-Annotator pipeline heavily relies on GPT-4o and ChatGPT for generating explanations, which may introduce linguistic or conceptual bias. The authors claim reliability improvement via prior knowledge but do not quantify annotation quality beyond mode
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
