Fine-grained Knowledge Graph-driven Video-Language Learning for Action   Recognition

Rui Zhang; Yafen Lu; Pengli Ji; Junxiao Xue; Xiaoran Yan

arXiv:2407.14146·cs.MM·July 22, 2024·1 cites

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

Rui Zhang, Yafen Lu, Pengli Ji, Junxiao Xue, Xiaoran Yan

PDF

Open Access

TL;DR

This paper introduces KG-CLIP, a knowledge graph-guided contrastive learning framework that enhances video action recognition by capturing fine-grained semantic relationships between actions and body movements, especially effective with limited data.

Contribution

The paper proposes a novel knowledge graph-driven contrastive learning approach that incorporates multi-grained action concepts into the CLIP model for improved fine-grained video understanding.

Findings

01

Outperforms baseline methods on Kinetics-TPS dataset.

02

Excels in action recognition with few sample frames.

03

Demonstrates strong data efficiency and learning capability.

Abstract

Recent work has explored video action recognition as a video-text matching problem and several effective methods have been proposed based on large-scale pre-trained vision-language models. However, these approaches primarily operate at a coarse-grained level without the detailed and semantic understanding of action concepts by exploiting fine-grained semantic connections between actions and body movements. To address this gap, we propose a contrastive video-language learning framework guided by a knowledge graph, termed KG-CLIP, which incorporates structured information into the CLIP model in the video domain. Specifically, we construct a multi-modal knowledge graph composed of multi-grained concepts by parsing actions based on compositional learning. By implementing a triplet encoder and deviation compensation to adaptively optimize the margin in the entity distance function, our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Digital Imaging for Blood Diseases