TL;DR
Skyra is a multimodal large language model designed to detect AI-generated videos by identifying visual artifacts and providing human-understandable explanations, supported by a new large-scale dataset and benchmark.
Contribution
The paper introduces Skyra, a novel model that detects and explains AI-generated videos using grounded artifact reasoning, along with a large dataset and benchmark for evaluation.
Findings
Skyra outperforms existing detection methods on multiple benchmarks.
The model provides human-interpretable explanations for its detections.
The new dataset ViF-CoT-4K enables detailed artifact annotation for training.
Abstract
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
