Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization
Sathyanarayanan N. Aakur, Fillipe DM de Souza, Sudeep Sarkar

TL;DR
This paper introduces an energy minimization framework that incorporates large-scale commonsense knowledge to improve video activity interpretation, reducing reliance on extensive annotated datasets and enhancing semantic understanding.
Contribution
The paper presents a novel framework that leverages commonsense knowledge bases like ConceptNet for semantic reasoning in video analysis, addressing data imbalance and complex scene understanding.
Findings
Outperforms state-of-the-art methods on three datasets.
Reduces training data requirements through semantic priors.
Handles complex semantic relationships effectively.
Abstract
A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data. To this end, we propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Using three different publicly available datasets - Charades, Microsoft…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
