Knowledge-Guided Short-Context Action Anticipation in Human-Centric   Videos

Sarthak Bhagat; Simon Stepputtis; Joseph Campbell; Katia Sycara

arXiv:2309.05943·cs.CV·September 13, 2023·1 cites

Knowledge-Guided Short-Context Action Anticipation in Human-Centric Videos

Sarthak Bhagat, Simon Stepputtis, Joseph Campbell, Katia Sycara

PDF

Open Access

TL;DR

This paper introduces a transformer-based approach enhanced with a symbolic knowledge graph to improve long-term human action anticipation in short video segments, outperforming existing methods on benchmark datasets.

Contribution

It presents a novel integration of symbolic knowledge graphs with transformers for action anticipation, boosting performance on long-term predictions from short video clips.

Findings

01

Outperforms state-of-the-art methods by up to 9% on benchmark datasets

02

Effective use of symbolic knowledge graphs enhances transformer attention mechanisms

03

Improves long-term action anticipation accuracy in human-centric videos

Abstract

This work focuses on anticipating long-term human actions, particularly using short video segments, which can speed up editing workflows through improved suggestions while fostering creativity by suggesting narratives. To this end, we imbue a transformer network with a symbolic knowledge graph for action anticipation in video segments by boosting certain aspects of the transformer's attention mechanism at run-time. Demonstrated on two benchmark datasets, Breakfast and 50Salads, our approach outperforms current state-of-the-art methods for long-term action anticipation using short video context by up to 9%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings