Contextual Explainable Video Representation: Human Perception-based Understanding
Khoa Vo, Kashu Yamazaki, Phong X. Nguyen, Phat Nguyen, Khoa Luu, Ngan, Le

TL;DR
This paper introduces a human perception-inspired, explainable approach for extracting contextual video representations, improving understanding of actions and scenes by modeling actors, objects, and environment interactions.
Contribution
It proposes a novel, explainable video representation method based on human perception factors, enhancing the interpretability and effectiveness of video understanding tasks.
Findings
Improved performance in video captioning and action detection tasks.
Enhanced interpretability of video representations.
Demonstrated effectiveness of perception-based modeling.
Abstract
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
