How can objects help action recognition?

Xingyi Zhou; Anurag Arnab; Chen Sun; Cordelia Schmid

arXiv:2306.11726·cs.CV·June 21, 2023·1 cites

How can objects help action recognition?

Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

PDF

Open Access 1 Repo

TL;DR

This paper introduces object-guided token sampling and object-aware attention to enhance video action recognition, enabling models to use fewer tokens while maintaining or improving accuracy across multiple datasets.

Contribution

It presents novel object-guided token sampling and object-aware attention modules that improve efficiency and accuracy in video action recognition models.

Findings

01

Achieves comparable accuracy with 30-60% of input tokens.

02

Improves accuracy by 0.6 to 4.2 points when using the same number of tokens as baseline.

03

Outperforms strong baselines on multiple datasets.

Abstract

Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/scenic
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training