VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding

Baoyao Yang; Wanyun Li; Dixin Chen; Junxiang Chen; Wenbin Yao; Haifeng Lin

arXiv:2507.18552·cs.CV·July 25, 2025

VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding

Baoyao Yang, Wanyun Li, Dixin Chen, Junxiang Chen, Wenbin Yao, Haifeng Lin

PDF

Open Access 1 Datasets

TL;DR

VideoMind is a comprehensive, multi-layered video dataset with detailed annotations and intent expressions, designed to advance deep cognitive understanding and multi-modal analysis of videos.

Contribution

The paper introduces VideoMind, a novel large-scale dataset with intent annotations and a benchmark for deep video understanding using multi-modal and hierarchical descriptions.

Findings

01

Models achieve improved understanding of intent and context.

02

Hybrid-cognitive retrieval experiments demonstrate effective deep comprehension.

03

VideoMind enables fine-grained cross-modal alignment research.

Abstract

This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind's key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DixinChen/VideoMind
dataset· 69 dl
69 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis