COMPOSER: Compositional Reasoning of Group Activity in Videos with   Keypoint-Only Modality

Honglu Zhou; Asim Kadav; Aviv Shamsian; Shijie Geng; Farley Lai; Long; Zhao; Ting Liu; Mubbasir Kapadia; Hans Peter Graf

arXiv:2112.05892·cs.CV·July 26, 2022

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long, Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

PDF

Open Access 1 Repo

TL;DR

COMPOSER is a multiscale transformer model that performs compositional reasoning for group activity recognition using only keypoint data, reducing privacy concerns and achieving significant performance improvements.

Contribution

It introduces a novel multiscale transformer architecture that models group activities compositionally using keypoints, with improved multiscale clustering and training techniques.

Findings

01

Achieves up to +5.4% accuracy improvement on datasets.

02

Effectively reduces scene bias by using only keypoint modality.

03

Demonstrates strong interpretability and performance on benchmark datasets.

Abstract

Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hongluzhou/composer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer