FCMBench-Video: Benchmarking Document Video Intelligence
Runze Cui, Fangxin Shang, Yehui Yang, Qing Yang, Yanwu Xu, Tao Chen

TL;DR
FCMBench-Video is a comprehensive benchmark designed to evaluate document-video understanding capabilities in financial contexts, focusing on perception, reasoning, and robustness across diverse document types and conditions.
Contribution
It introduces a large-scale, realistic dataset and evaluation framework for assessing Video-MLLMs on document perception, reasoning, and robustness in authenticity-sensitive applications.
Findings
Evaluations reveal that counting tasks are highly duration-sensitive.
Cross-Document Validation and Evidence-Grounded Selection assess higher-level evidence integration.
The benchmark effectively differentiates system capabilities and robustness.
Abstract
Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
