A Unified Model for Video Understanding and Knowledge Embedding with   Heterogeneous Knowledge Graph Dataset

Jiaxin Deng; Dong Shen; Haojie Pan; Xiangyu Wu; Ximan Liu; Gaofeng; Meng; Fan Yang; Size Li; Ruiji Fu; Zhongyuan Wang

arXiv:2211.10624·cs.CV·April 4, 2023

A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

Jiaxin Deng, Dong Shen, Haojie Pan, Xiangyu Wu, Ximan Liu, Gaofeng, Meng, Fan Yang, Size Li, Ruiji Fu, Zhongyuan Wang

PDF

Open Access

TL;DR

This paper introduces a new heterogeneous dataset combining multi-modal video entities and common sense relations, and proposes an end-to-end model that integrates video understanding with knowledge graph embedding to improve retrieval and inference tasks.

Contribution

It creates a novel dataset for joint video understanding and knowledge embedding, and develops a unified model that enhances content retrieval and knowledge inference performance.

Findings

01

Knowledge-enhanced video embeddings improve retrieval accuracy.

02

The model outperforms traditional KGE methods on new inference tasks.

03

Joint optimization benefits both video understanding and knowledge embedding.

Abstract

Video understanding is an important task in short video business platforms and it has a wide application in video recommendation and classification. Most of the existing video understanding works only focus on the information that appeared within the video content, including the video frames, audio and text. However, introducing common sense knowledge from the external Knowledge Graph (KG) dataset is essential for video understanding when referring to the content which is less relevant to the video. Owing to the lack of video knowledge graph dataset, the work which integrates video understanding and KG is rare. In this paper, we propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations. This dataset also provides multiple novel video inference tasks like the Video-Relation-Tag (VRT) and Video-Relation-Video (VRV) tasks. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition