TL;DR
This paper introduces VHAKG, a multi-modal knowledge graph built from synchronized multi-view videos of daily activities, capturing detailed event and frame-level information to support knowledge processing and model benchmarking.
Contribution
The paper presents a novel MMKG constructed from synchronized multi-view videos, including fine-grained frame details and tools for querying, advancing multi-modal knowledge graph construction.
Findings
Facilitates benchmarking vision-language models.
Includes detailed frame-by-frame changes.
Supports querying and knowledge processing.
Abstract
Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data (e.g., images and videos) into symbols, have attracted attention as resources enabling knowledge processing and machine learning across modalities. However, the construction of MMKGs for videos consisting of multiple events, such as daily activities, is still in the early stages. In this paper, we construct an MMKG based on synchronized multi-view simulated videos of daily activities. Besides representing the content of daily life videos as event-centric knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as bounding boxes within video frames. In addition, we provide support tools for querying our MMKG. As an application example, we demonstrate that our MMKG facilitates benchmarking vision-language models by providing the necessary vision-language datasets for a tailored task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
