Action Class Relation Detection and Classification Across Multiple Video Datasets
Yuya Yoshikawa, Yutaro Shigeto, Masashi Shimbo, Akikazu Takeuchi

TL;DR
This paper introduces a unified model for detecting and classifying relations between action classes across multiple video datasets, leveraging language and visual data to enhance dataset augmentation for human action recognition.
Contribution
It proposes a novel approach to determine relations between action classes in different datasets using combined language and visual information, improving relation prediction accuracy.
Findings
Language-based relation prediction outperforms video-based methods.
Pre-trained neural models significantly enhance prediction accuracy.
Combining language and visual modalities improves relation detection in some cases.
Abstract
The Meta Video Dataset (MetaVD) provides annotated relations between action classes in major datasets for human action recognition in videos. Although these annotated relations enable dataset augmentation, it is only applicable to those covered by MetaVD. For an external dataset to enjoy the same benefit, the relations between its action classes and those in MetaVD need to be determined. To address this issue, we consider two new machine learning tasks: action class relation detection and classification. We propose a unified model to predict relations between action classes, using language and visual information associated with classes. Experimental results show that (i) pre-trained recent neural network models for texts and videos contribute to high predictive performance, (ii) the relation prediction based on action label texts is more accurate than based on videos, and (iii) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications
