DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding
Cennet Oguz, Yasser Hamidullah, Josef van Genabith, Simon Ostermann

TL;DR
DualFact is a multimodal framework that evaluates factual accuracy in procedural video captioning by separating conceptual and contextual facts, improving alignment with human judgments.
Contribution
It introduces a dual-layer evaluation method with implicit argument augmentation and contrastive fact sets, enhancing factual verification for multimodal video captioning.
Findings
DualFact correlates more strongly with human judgments than standard metrics.
State-of-the-art models often produce factually incomplete captions with omissions.
Video-grounded verification reduces overestimation of hallucinations.
Abstract
We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning. DualFact separates factual correctness into conceptual facts, capturing abstract semantic roles (e.g., Action, Ingredient, Tool, Location), and contextual facts, capturing their grounded predicate-argument realizations in video. To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets. We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which verifies facts against video-grounded visual evidence. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but often factually incomplete captions, with systematic omissions and role-level inconsistencies. DualFact correlates more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
