Knowledge Augmented Relation Inference for Group Activity Recognition

Xianglong Lang; Zhuming Wang; Zun Li; Meng Tian; Ge Shi; Lifang Wu and; Liang Wang

arXiv:2302.14350·cs.CV·March 2, 2023

Knowledge Augmented Relation Inference for Group Activity Recognition

Xianglong Lang, Zhuming Wang, Zun Li, Meng Tian, Ge Shi, Lifang Wu and, Liang Wang

PDF

Open Access

TL;DR

This paper introduces a novel framework that leverages concretized knowledge to enhance relation inference and individual representations, significantly improving group activity recognition accuracy.

Contribution

It proposes a Knowledge Augmented Relation Inference framework that integrates visual and semantic knowledge for better group activity recognition.

Findings

01

Achieves competitive performance on public datasets.

02

Effectively utilizes knowledge to improve relation inference.

03

Enhances individual representations with knowledge integration.

Abstract

Most existing group activity recognition methods construct spatial-temporal relations merely based on visual representation. Some methods introduce extra knowledge, such as action labels, to build semantic relations and use them to refine the visual presentation. However, the knowledge they explored just stay at the semantic-level, which is insufficient for pursing notable accuracy. In this paper, we propose to exploit knowledge concretization for the group activity recognition, and develop a novel Knowledge Augmented Relation Inference framework that can effectively use the concretized knowledge to improve the individual representations. Specifically, the framework consists of a Visual Representation Module to extract individual appearance features, a Knowledge Augmented Semantic Relation Module explore semantic representations of individual actions, and a Knowledge-Semantic-Visual…

Tables4

Table 1. TABLE I: Comparison with state-of-the-art methods on the Volleyball dataset.

Method	Backbone	Optical Flow	MCA
HDTM [5]	AlexNet		81.9
CERN [32]	Vgg16		83.3
StagNet [28]	Vgg16		89.3
Detector-Free [31]	Resnet-18		90.5
SSU [6]	Inception-v3		90.6
HiGCIN [37]	Resnet-18		91.4
AT [10]	I3D		91.4
PRL [30]	Vgg16		91.4
ARG [9]	Inception-v3		92.5
STBiP [13]	Inception-v3		93.3
STDIN [11]	Vgg16		93.6
Groupformer [12]	Inception-v3		94.1
Dual-AI [46]	Inception-v3		94.4
SBGAR [40]	Inception-v3	✓	66.9
CRM [29]	I3D	✓	93.0
AT [10]	I3D	✓	93.0
JLSG [39]	I3D	✓	93.1
MSCA-GNN [16]	I3D	✓	93.4
ERN [47]	R50-FPN+I3D	✓	94.1
Groupformer [12]	I3D	✓	94.9
Dual-AI [15]	Inception-v3	✓	95.4
Ours(RGB)	I3D		94.5
Ours(RGB+Flow)	I3D	✓	94.8

Table 2. TABLE II: Comparison against state-of-the-art methods on the Volleyball dataset under limited training data.

Method	Data Ratio
Method	5%	10%	25%	50%	100%
PCTDM [36]	53.6	67.4	81.5	88.5	90.3
AT [10]	54.8	67.7	84.2	88.0	90.0
HiGCIN [37]	35.5	55.5	71.2	79.7	91.4
ERN [39]	41.2	52.5	73.1	75.4	90.7
ARG [9]	69.4	80.2	87.9	90.1	92.3
STDIN [11]	58.3	71.7	84.1	89.9	93.1
Dual-AI [15]	76.2	85.5	89.7	92.7	94.4
Ours-Base	66.2	78.8	88.2	92.2	93.4
Ours	79.0	85.6	92.1	93.2	94.5

Table 3. TABLE III: Comparison with state-of-the-art methods on the Collective Activity dataset.

Method	Backbone	MCA	MPCA
SBGAR [40]	Inception-v3	86.1	-
Recurrent [27]	Vgg16	-	89.4
PCTDM [35]	AlexNet	-	92.2
PRL [30]	Vgg16	-	93.8
CRM [29]	I3D	85.8	94.2
JLSG [39]	I3D	89.4	-
SPTS [17]	Vgg16	90.7	95.7
AT [10]	I3D	92.8	98.5
MSCA-GNN [16]	I3D	93.1	-
ERN [47]	R50-FPN+I3D	93.9	-
Groupformer [12]	I3D	94.7	-
Dual-AI [15]	Inception-v3	-	96.5
Ours(RGB)	I3D	92.8	98.5
Ours(RGB+Flow)	I3D	93.5	98.7

Table 4. TABLE IV: Ablation studies of the introduced knowledge on the Volleyball dataset.

Model	MCA
Base	93.4
Base + Semantic	93.6
Base + Semantic + C-P Map	93.9
Base + Semantic + C-C Map	94.0
Base + Semantic + C-C Map + C-P Map	94.5

Equations24

p_{ij}^{cc} = \frac{m _{ij}}{i = 1 \sum K j = 1 \sum K m _{ij}}

p_{ij}^{cc} = \frac{m _{ij}}{i = 1 \sum K j = 1 \sum K m _{ij}}

p_{i x y}^{c p} = \frac{b _{i x y}}{x = 1 \sum H y = 1 \sum W b _{i x y}}

p_{i x y}^{c p} = \frac{b _{i x y}}{x = 1 \sum H y = 1 \sum W b _{i x y}}

A_{i}^{s} = σ (\frac{Y W _{i}^{Q} \cdot ( Y W _{i}^{K} ) ^{T}}{d})

A_{i}^{s} = σ (\frac{Y W _{i}^{Q} \cdot ( Y W _{i}^{K} ) ^{T}}{d})

h_{i}^{s} = (A_{i}^{s} + P^{cc}) \cdot Y W_{i}^{V}

h_{i}^{s} = (A_{i}^{s} + P^{cc}) \cdot Y W_{i}^{V}

\overline{Y} = F^{s} ([h_{1}^{s}, h_{2}^{s}, ..., h_{i}^{s}])

\overline{Y} = F^{s} ([h_{1}^{s}, h_{2}^{s}, ..., h_{i}^{s}])

A_{i}^{o} = σ \frac{X W _{i}^{Q} \cdot ( Y W _{i}^{K} ) ^{T}}{d}

A_{i}^{o} = σ \frac{X W _{i}^{Q} \cdot ( Y W _{i}^{K} ) ^{T}}{d}

h_{i}^{o} = (A_{i}^{o} + P^{cp}) \cdot Y W_{i}^{V}

h_{i}^{o} = (A_{i}^{o} + P^{cp}) \cdot Y W_{i}^{V}

O = F^{o} ([h_{1}^{o}, h_{2}^{o}, ..., h_{i}^{o}])

O = F^{o} ([h_{1}^{o}, h_{2}^{o}, ..., h_{i}^{o}])

L_{x} = L_{C E} (\hat{y}_{g}^{x}, y_{g}) + λ L_{C E} (\hat{y}_{a}^{x}, y_{a})

L_{x} = L_{C E} (\hat{y}_{g}^{x}, y_{g}) + λ L_{C E} (\hat{y}_{a}^{x}, y_{a})

L_{o} = L_{C E} (\hat{y}_{g}^{o}, y_{g}) + λ L_{C E} (\hat{y}_{a}^{o}, y_{a})

L_{o} = L_{C E} (\hat{y}_{g}^{o}, y_{g}) + λ L_{C E} (\hat{y}_{a}^{o}, y_{a})

L_{s} = L_{C E} (\hat{y}_{g}^{s}, y_{g})

L_{s} = L_{C E} (\hat{y}_{g}^{s}, y_{g})

L = L_{x} + L_{o} + L_{s}

L = L_{x} + L_{o} + L_{s}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems

Full text

Knowledge Augmented Relation Inference for Group Activity Recognition

Xianglong Lang

Zhuming Wang

Zun Li

Meng Tian

Ge Shi

Lifang Wu

Faculty of Information Technology, Beijing University of Technology, Beijing, China

Liang Wang

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract

Most existing group activity recognition methods construct spatial-temporal relations merely based on visual representation. Some methods introduce extra knowledge, such as action labels, to build semantic relations and use them to refine the visual presentation. However, the knowledge they explored just stay at the semantic-level, which is insufficient for pursing notable accuracy. In this paper, we propose to exploit knowledge concretization for the group activity recognition, and develop a novel Knowledge Augmented Relation Inference framework that can effectively use the concretized knowledge to improve the individual representations. Specifically, the framework consists of a Visual Representation Module to extract individual appearance features, a Knowledge Augmented Semantic Relation Module explore semantic representations of individual actions, and a Knowledge-Semantic-Visual Interaction Module aims to integrate visual and semantic information by the knowledge. Benefiting from these modules, the proposed framework can utilize knowledge to enhance the relation inference process and the individual representations, thus improving the performance of group activity recognition. Experimental results on two public datasets show that the proposed framework achieves competitive performance compared with state-of-the-art methods.

I Introduction

Group activity recognition is an important sub-task in the field of video understanding. It shows wide application prospects in intelligent robots, security monitoring, and sports event analysis. Unlike the action recognition which focuses on a single individual [1, 2, 3, 4], group activity recognition needs to understand the scene of multiple individuals. This task is more challenging since it relies on the understanding of not only the actions of multiple individuals but also relations among them in the scene. Therefore, both effective individual features and relation modeling are essential to the group activity recognition.

Existing methods generally enhance the visual representation of individuals by introducing relation inference [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] with the graph network or transformer. However, they build relations mainly based on the visual representations or locations of individuals, which are not completely consistent with the semantic-level individual relations in the group activity. Some methods [16, 17, 18, 19, 20] introduce extra knowledge, such as action labels, to build semantic relations. By introducing knowledge, these methods improve the group activity recognition performance well, but the knowledge they explored merely stay at the semantic-level (i.e., individual action labels), which is insufficient for pursing notable accuracy.

In fact, there is abundant knowledge in real group activity recognition scenarios. For example, in team sports, one particular group activity represents the execution and implementation of a specific tactic that reflects the correlation and distribution of corresponding individual actions. As shown in Fig. 1, the r-spike activity in volleyball matches involves a “spiking” player in the offensive zone, several “waiting” players in the defensive zone, and several “blocking” players in the opposing offensive zone. Under this description, a r-spike activity will never involve a “setting” player. This fact provides a clue for distinguishing other activities such as r-set. Therefore, if the knowledge can be leveraged more sufficiently, we may have a good chance to improve the reliability of visual representation and interaction modeling, hence further improve the performance of recognition.

Nevertheless, knowledge is usually extracted from a large amount of samples, it is a kind of highly generalized abstract representation. In contrast, the input samples are concrete. Therefore, it is critical to concretize abstract knowledge into the same space as input samples. Although several methods utilize cross-model aggregation [16] or knowledge distillation [18] to concretize the semantic label as the latent feature vector for the visual representation interacting, such concretization manner is hard to leverage richer knowledge. In fact, group activity is a comprehensive expression of a group of individual actions, which is correlated with individual actions and their position, and such correlation is implied in a large number of samples. Thus, it is possible to obtain a concrete representation of knowledge from amount of training samples through statistics.

In this paper, we propose to concretize the abstract knowledge, such as tactics, into the action distribution in different group activities, which is further represented as Class-Class Distribution Map (C-C Map) and Class-Position Distribution Map (C-P Map). They present the correlation and distribution of individual actions. Furthermore, we propose a novel Knowledge Augmented Relation Inference Framework to construct interactions among individuals and use the above two maps to enhance the individual representations for group activity recognition. Specifically, we first design a Visual Representation Module to extract the individual appearance representations. Then we design a Semantic Relation Module to construct the correlation between different individual actions with the assistance of the C-C Map. After that, a Knowledge-Semantic-Visual Interaction Module is devised to integrate visual information and semantic information through a cross-modal interacting block, combining the C-P Map to perform the relation inference and improve the individual representation ability. Finally, the enhanced individual features, along with raw visual features, are utilized for activity recognition. We evaluate our method on the Volleyball dataset and the Collective Activity dataset, and the experimental results show that the proposed framework achieves competitive performance compared to the state-of-the-art methods.

The contributions of this paper are summarized below:

•

We propose an idea of knowledge concretization for the group activity recognition. And the knowledge in the specific application scenarios such as team sports or surveillance is concretized as the Class-Class Distribution Map (C-C Map) or Class-Position Distribution Map (C-P map).

•

We propose a novel Knowledge Augmented Relation Inference framework which integrates visual representation and knowledge (i.e., action labels, C-C Map, C-P Map) in a unified relation inference architecture.

•

Experiments on two public datasets show that the proposed method achieves competitive results compared with state-of-the-art methods. And the introduction of knowledge is also helpful in improving performance with limited training data.

II Related Work

Group activity recognition has been studied for over a decade. A lot of methods have been proposed. Early methods extract hand-crafted features to infer group behavior by probabilistic graphical models [21, 22, 23, 24, 25]. Recently, deep relation inference based methods have shown promising performance [26]. They can be generally classified into the visual representation based method and visual-semantic representation based methods, according to the information they used.

Visual Representation based Methods. Visual representation based methods usually obtain the enhanced visual representation by introducing visual relation inference [5, 27, 6, 7, 28, 18, 29, 30, 13, 31, 15]. Some methods adopt RNN or LSTM to explore the individual spatio-temporal relation in scene [5, 32, 6, 28, 33, 34, 35, 36]. Alternatively, some researchers introduce attention mechanism to relation inference [9, 37, 38, 11, 10, 39, 12, 31, 15], and improve the representation of visual features. Wu et al. [9] utilize a graph structure to construct the relations between actors in the scene and enhance their representation by graph convolution network. Yan et al. [37] construct a cross-graph to explore the temporal dynamics and spatial interaction context. Yuan et al. [11] use the well-designed dynamic relation module and dynamic walk module to build person-specific interaction, which can model spatial-temporal effectively. Gavrilyuk et al. [10] adopt transformer architecture to model spatial-temporal relations among individuals and use a multi-modal information fusion strategy. Li et al. [12] propose a Clustered Spatial-Temporal Transformer to deeply explore the correlation of spatial and temporal context in a parallel manner. Yuan and Ni [13] encode the global contextual information into individual features and explore all pairwise interactions between individuals. Han et al. [15] propose a Dual-path Actor Interaction framework to learn complex actor relations in videos and further enhance individual representation by using an efficient self-supervised signal.

Visual-Semantic Representation based Methods. Visual-semantic representation based methods introduce semantic information to relation inference and improve the consistency of visual relations with the semantic-level individual relations in the group activity [40, 28, 17, 20, 19, 16]. Li et al. [40] propose a novel semantics based scheme that recognizes group activities based on the semantic meaning of video captions generated by LSTM. Qi et al. [28] introduce a semantic graph to explicitly describe the spatial content of the scene and employ a structural-RNN to incorporate it with the temporal factor. Liu et al. [16] directly utilize individual action labels to construct a semantic graph to refine visual representations. Tang et al. [17] adopt knowledge distillation to force individual visual representations to be consistent with semantic representations embedded from action labels.

The existing methods demonstrate that the introduction of extra knowledge is helpful in improving visual representation. However, the knowledge they explored is merely a small part of knowledge in the real application scenarios, and the way of knowledge utilization can not adapt to the complicated knowledge. Unlike these methods, we utilize richer knowledge, such as tactical information, and introduce the concretized knowledge into the relation inference framework to improve the feature representation.

III Method

III-A Overall Architecture

Our framework is mainly composed of three modules: the Visual Representation Module for extracting appearance representation of individuals, the Knowledge Augmented Semantic Relation Module for encoding semantic representations of individual actions, and the Knowledge-Semantic-Visual Interaction Module for aggregating visual and semantic information. As illustrated in Fig. 2, our framework first summarizes the individual actions, and constructs a Class-Class Distribution Map (C-C Map) and a Class-Position Distribution Map (C-P Map). Then, it feds C-C Map into the Knowledge Augmented Semantic Relation Module to enhance the semantic representation. Afterward, it utilizes the semantic representation and the C-P Map to enhance the visual representation through the Knowledge-Semantic-Visual Interaction Module. Finally, it predicts the group activities using the enhanced visual representations.

III-B Knowledge Concretization

As discussed above, knowledge is an abstract semantic representation with a gap from training samples. Thus, prior to the training procedure, we concretize knowledge in a form that can be integrated with visual representation of samples. To be specific, we summarize the individual actions of samples in the training set to concretize the semantic representation of knowledge, and further construct a Class-Class Distribution Map (C-C Map) and a Class-Position Distribution Map (C-P Map). These two maps can present the correlation and distribution of individual actions in one particular group activity.

Class-Class Distribution Map. Given $K$ different individual action labels ${\mathbf{L}}=\{{l_{i}}\}_{i=1}^{K}$ in the training set, we count the total concurring times $m_{ij}$ of the $i$ -th action label $l_{i}$ and the $j$ -th action label $l_{j}$ . For example, if two “blocking” players and three “standing” players occur in an image, we record the concurring times of “blocking” and “standing” in this image as 6. Then we add up their concurring times in all images to get the total concurring times of “blocking” and “standing”. In this way, we construct the Class-Class Distribution Map $\mathbf{P^{cc}}\in{\mathbb{R}^{K\times K}}$ , which measures the correlation degree among individual action labels, as follow:

[TABLE]

where $p^{cc}_{ij}\in\mathbf{P^{cc}}$ corresponds to the correlation of $i$ -th and $j$ -th label. This value reflects the probability of the simultaneous occurrence of different actions in a specific scenario.

Class-Position Distribution Map. On the middle frame of every video clip in the training set, for the $i$ -th individual action, we mark the coordinate $(x,y)$ of each individual who performs it. Then we project these coordinates of all video clips onto one single image $b_{i}$ , which shares the same size with input frames. This way, we can obtain the distribution maps of $K$ individual action labels. Similar to the Class-Class Distribution Map, we construct the Class-Position Distribution Map $\mathbf{P^{cp}}\in{\mathbb{R}^{H\times W\times K}}$ , which represents the distribution of individual actions, as follow:

[TABLE]

where $p^{cp}_{ixy}\in\mathbf{P^{cp}}$ denotes the value of $\mathbf{P^{cp}}$ of the $i$ -th individual action at the coordinate of $(x,y)$ . And $b_{ixy}$ denotes the value of $b_{i}$ at the coordinate of $(x,y)$ , it reflects the occurrence probability of each individual action in a specific spatial location.

III-C Visual Representation Module

The Visual Representation Module aims to extract the appearance features of individuals. As shown in Fig. 2, given a $T$ frames video clip, we adopt an inflated 3D convNets network (I3D) [4] pre-trained on Kinetics dataset [41] as the backbone to extract image appearance features, and employ two dimensions positional encoding (PE) to provide position information as in [42, 43]. In this way, we obtain the raw individual visual representation $\mathbf{X}\in\mathbb{R}^{H\times W\times C}$ , where $C$ is the number of channels. $\mathbf{X}$ also represents the global scene information directly extracted from the input frame. Then, the RoiAlign [44] operation is applied to extract the refined visual features of individuals from $\mathbf{X}$ according to the body part bounding boxes of each individual in the scene. After that, we utilize the fully-connected layer and ReLU [45] activation function to encode the extracted features as a $D$ dimensional feature vector $\mathbf{\overline{X}}\in\mathbb{R}^{N\times P\times D}$ , where $N$ represents the number of actors in the scene, $P$ is the number of body parts. $\mathbf{\overline{X}}$ presents the visual representation of individuals and is further utilized to perform the individual relational inference in a later module.

III-D Knowledge Augmented Semantic Relation Module

In this module, we first encode $K$ different individual action labels $\mathbf{L}$ into one-hot vectors and embed them into a $D$ dimension latent space to obtain the semantic features $\mathbf{Y}\in{\mathbb{R}^{K\times D}}$ . After that, we propose a Semantic Transformer to infer the relation among semantic features of different individual actions, and introduce the C-C Map into the multi-head self-attention mechanism of the standard transformer. In conventional self-attention operation, the output is computed as a weighted sum of values ( $V$ ), where the weights are computed by the correlation function of queries ( $Q$ ) and keys ( $K$ ). As shown in Fig. 3, to better explore the correlation of different semantic features, we add C-C Map to the weights of values before the weighted sum operation. In this way, we can use real data distribution to facilitate the relation modeling. The output of multi-head self-attention mechanism $\mathbf{\overline{Y}}$ can be formulated as:

[TABLE]

where $W^{Q}_{i},W^{K}_{i},W^{V}_{i}$ are the learnable matrices whose dimension is $D\times d$ , $i$ is the number of attention heads. $\sigma$ presents the softmax operation. $[,]$ denotes the concatenate operation. $\mathrm{F}^{s}$ refers to the fully-connected layer adopted to integrate the outputs of multiple attention heads.

The residual connection operation and a feed-forward network are adopted to enhance the feature representation, and the final output of the Semantic Transformer is denoted as $\mathbf{\widehat{Y}}\in\mathbb{R}^{K\times D}$ .

III-E Knowledge-Semantic-Visual Interaction Module

This module enhances the individual representations by integrating visual and semantic representation with the assistance of the C-P Map, and performs the task of group activity recognition. We first employ a conventional vision transformer encoder [10] to enhance $\mathbf{\overline{X}}$ output from Visual Representation Module and obtain a refined individual representation $\mathbf{\widehat{X}}\in\mathbb{R}^{N\times P\times D}$ which performed preliminary relational inference, and a multilayer perceptron to adjust the shape of $\mathbf{\widehat{X}}$ into ${N\times D}$ . Then, we design a Visual-Semantic Inference Transformer to perform individual relation inference across the visual and semantic representation. Specifically, we feed $\mathbf{\widehat{X}}$ into the encoder of the transformer and output the encoded visual representation. The multi-head self-attention mechanism used in the above encoder is similar to which in the Semantic Transformer we introduced in Sec. III-D, yet the C-C Map is not added before the weighted sum operation. The sum of $\mathbf{\widehat{X}}$ and the output of the self-attention mechanism is denoted as $\mathbf{\widetilde{X}}\in\mathbb{R}^{N\times D}$ and further used in the subsequent process.

The decoder of the Visual-Semantic Inference Transformer takes the enhanced semantic feature representation $\mathbf{\widehat{Y}}$ , the encoded visual representation $\mathbf{\widetilde{X}}$ and C-P Map as input to export the knowledge augmented individual features $\mathbf{\overline{O}}\in{\mathbb{R}^{N\times D}}$ . In addition, to better correspond the individuals with the distribution of their actions, we first select the corresponding $N$ position coordinates from the C-P Map according to the bounding box of $N$ individuals on the frame and convert the dimension of the C-P Map into $N\times K$ from $H\times W\times K$ . As shown in Fig. 4, we use a multi-head cross-attention mechanism to perform semantic-visual interaction as follow:

[TABLE]

where ${W^{\widehat{Q}}_{i}},{W^{\widehat{K}}_{i}},{W^{\widehat{V}}_{i}}$ are the learnable matrices whose dimension is $D\times d$ , $i$ is the number of attention heads. $\mathrm{F}^{o}$ denotes the fully-connected layer adopted to integrate the outputs of multiple attention heads. In this way, we can take the distribution of individual actions to guide the representation of features.

The residual connection operation and a feed-forward network are adopted to enhance the feature representation, and output the knowledge augmented individual feature $\mathbf{\overline{O}}$ .

III-F Training and Reasoning

Our framework is trained in an end-to-end manner. To supervise the feature representation learning, we design two classification heads to perform the individual action classification and group activity classification for both $\mathbf{\widehat{X}}$ and $\mathbf{\overline{O}}$ . We also utilize global scene information $\mathbf{X}$ to perform group activity classification. The classification losses $\mathcal{L}_{x}$ , $\mathcal{L}_{o}$ and $\mathcal{L}_{s}$ can be formulated as:

[TABLE]

where $\mathcal{L}_{CE}(\cdot)$ denotes the cross-entropy loss function. $\mathbf{y}_{a}$ , $\mathbf{y}_{\mathbf{g}}$ are the ground truth labels for individual actions and group activities, receptively. $\mathbf{\hat{y}}^{x}_{a}$ and $\mathbf{\hat{y}}^{o}_{a}$ represent individual action scores predicted from $\mathbf{\widehat{X}}$ and $\mathbf{\overline{O}}$ . Similarly, $\mathbf{\hat{y}}^{x}_{\mathbf{g}}$ and $\mathbf{\hat{y}}^{o}_{\mathbf{g}}$ represent group activity scores predicted from $\mathbf{\widehat{X}}$ and $\mathbf{\overline{O}}$ , respectively. $\mathbf{\hat{y}}^{s}_{\mathbf{g}}$ is group activities scores predicted from $\mathbf{X}$ . $\lambda$ is the scalar weight to balance different classification tasks. The overall loss function is formed as follow:

[TABLE]

In the inference stage, we sum the group activity classification scores $\mathbf{\hat{y}}^{x}_{\mathbf{g}}$ , $\mathbf{\hat{y}}^{o}_{\mathbf{g}}$ and $\mathbf{\hat{y}}^{s}_{\mathbf{g}}$ as the final classification score.

IV Experiments

IV-A Datasets

Volleyball dataset. The Volleyball dataset (VD) is one of the largest public datasets for evaluation of the group activity recognition. It has 3,493 and 1,337 video clips for training and testing, respectively. Moreover, it provides high-resolution video clips containing eight group activity categories, including left spike, right spike, left set, right set, left set, right set, left set, right set, left win, and right win. In the middle frame of each video clip, all individuals are labeled by bounding box coordinates and individual action categories (blocking, digging, falling, jumping, moving, setting, passing, spiking, standing, and waiting). Image resolution for VD is 1280 $\times$ 720.

Collective Activity dataset. The Collective Activity dataset (CAD) is another high-quality public dataset for group activity recognition. It contains 5 group activity categories (walking, crossing, waiting, talking, and queuing). The middle frame of every ten frames in this dataset is labeled with bounding box coordinates and individual action categories (NA, walking, crossing, waiting, talking, and queuing). The group activity category is determined by the vast majority of individual categories in the scene. Image resolution for CAD is 720 $\times$ 480.

IV-B Implementation Details

For each video clip, we select ten frames (middle frame, five frames before, and four frames after) as the input of the backbone network. We utilize the RoIAlign layer with crop size $7\times 7$ to obtain $\mathbf{\overline{X}}$ and embed them into $D=256$ . For experiments on VD, we employ two attention heads in the encoder layer of transformers and one attention head in the decoder. And for experiments on CAD, we set the number of attention heads in the encoder of transformers to 16. The dimension of $d$ is set to 128 and the size of the fully-connected layer in the feed-forward network is set to 1024. For the VD, we utilize Adam optimizer with $\beta_{1}=0.9,\beta_{2}=0.999$ and $\varepsilon=10^{-8}$ , empirically. The batch size is set to 1, the learning rate ranging from $1\times 10^{-4}$ to $1\times 10^{-6}$ . For the CAD, we set Adam optimizer hyper-parameters $\varepsilon=10^{-10}$ , the batch size to 2, and the initial learning rate as $5\times 10^{-5}$ . The other settings are the same as on the Volleyball dataset. We adopt the widely used Multi-class Classification Accuracy (MCA) and Mean Per Class Accuracy (MPCA) as evaluation metrics. Our experiments are conducted on an NVIDIA GeForce GTX 2080 GPU with PyTorch deep learning framework.

IV-C Comparison with the State-of-the-Arts

Results on Volleyball dataset. We compare our framework with the state-of-the-art methods on VD and report the results in Tab. I. Our method (RGB only) reaches the MCA of 94.5%, achieving the best performance among all comparison methods [5, 33, 28, 32, 9, 29, 30, 47, 13, 16] which is complemented without optical flow. Moreover, we propose an extended version of the framework that utilizes a late fusion strategy as [10, 13] to fusion RGB and Flow results. This version reaches the MCA of 94.8%. Such results illustrate that our framework can achieve competitive performance compared with state-of-the-art methods.

More importantly, although the performance of our method is slightly lower than [15] in the above experiments, our framework achieves better results under the limited training datasets. To demonstrate this, we conduct experiments on the VD with data ratio of 5%, 10%, 25%, and 50%. For a fair comparison, we select the same samples as [15]. The results of comparison methods [36, 9, 10, 37, 39, 11, 15] are reported directly from [15]. Tab. II presents the experimental results. As can be seen, our method obviously performs better than the state-of-the-art methods at the data ratio of 5%, 10%, 25%, and 50%. Particularly, it achieves the MCA of 79.0%, 85.6%, 92.1%, and 93.2%, surpassing the existing best results by 2.8%, 0.1%, 2.4%, and 0.5%, respectively. Furthermore, by comparing with the baseline model which only consists of the Visual Representation Module, and it predicts the classification score simply from $\mathbf{\overline{X}}$ and $\mathbf{X}$ , our method significantly improves performance with knowledge under limited training data. In particular, when taking 5%, 10%, 25% samples as the training data, our method achieves MCA higher by 12.8%, 6.8%, 3.9% than the baseline model, respectively. This clearly demonstrate the effectiveness and superiority of our method under limited training data.

Results on Collective Activity dataset. On CAD, we use both MCA and MPCA for evaluation. Similar to other methods [29, 13, 11, 15], we merge the category “walking” and “crossing” as “moving” to calculate MPCA. In addition, since the scenario of this dataset changes dramatically, there is no certain correlation between action labels and individual positions. Therefore, the Class-Position Distribution Map (C-P Map) has not been used in this dataset.

Tab. III reports the experimental results. All the compared methods adopt the optical flow. As can be seen from this table, our method achieves excellent results in terms of the MPCA. Specifically, our method gains the MPCA of 98.5% with only RGB input and outperforms all of the compared methods. When using optical flow inputs, it can further improve 0.2% and achieve the MPCA of 98.7%. While for the MCA, our method is slightly lower than the existing best result, this is mainly because “walking” and “crossing” have a high similarity in appearance, which makes our model confused in obtaining C-C Map, and further impact the recognition of these two categories.

IV-D Ablation Study

To investigate the effect of the introduced knowledge in the proposed framework, we conduct ablation studies on the VD with the following variants: (A) Base: it only consists of the Visual Representation Module and directly uses the individual representation $\mathbf{\overline{X}}$ and global scene information $\mathbf{X}$ for the final classification. (B) Base + Semantic: it removes both the Class-Class Distribution Map and Class-Position Distribution Map from the overall framework. (C) Base + Semantic + C-P Map: it removes the Class-Class Distribution Map from the whole framework. (D) Base + Semantic + C-C Map: it removes the Class-Position Distribution Map from the whole framework.

As shown in Tab. IV, the MCA of model (B) is improved by 0.2% compared with model (A), which indicates the effectiveness of introducing semantic information. The MCA of model (C) and model (D) is improved by 0.3%, 0.4% compared with the result of model (B), respectively. The complete framework introducing both two distribution maps can reach the MCA of 94.5%, outperforms ablation models by 1.1%, 0.9%, 0.6% and 0.5%, respectively. These results show the effectiveness of utilizing knowledge to enhance the relation inference process for group activity classification.

IV-E Visualization

The t-SNE visualization of learned representation. We adopt t-SNE [48] to analyze the feature distribution of different models on the VD. As shown in Fig. 5 (a), feature representations of “l-spike” cannot be separated well from “l-winpoint”. As in Fig. 5 (b) and (c), when C-P Map or C-C Map is introduced to enhance the relation inference process, our method can distinguish “l-spike” and “l-winpoint” well. As in Fig. 5 (d), our final framework is able to differentiate feature representations much better than others. These results obviously demonstrate the effectiveness of introducing knowledge to group activity recognition.

The visualization of predictions. An example of the group activity recognition on the VD is visualized in Fig. 6. Compared with ground truth in Fig. 6 (a), the model without C-P Map and C-C Map mistakenly classified the actions of four players as shown in Fig. 6 (b), which classified “setting” into “spiking” may result in the mistake of the group activity recognition. In fact, the appearance of a jumping-like player in Fig. 6 (b) looks like “spiking”. Therefore, it is reasonable for such misclassification since the model performs classification mainly based on visual information. In comparison, by introducing the knowledge concretized C-C Map and C-P Map, such as the “setting” action appearing in the defensive zone with a higher probability than the “spiking” action, our model enhances the visual representation and correctly classifies the “setting” action. Therefore, the group activity is correctly classified as shown in Fig. 6 (c).

V Conclusion

In this paper, we observe that the existing visual representation based group activity recognition methods have not explored the influence of abundant knowledge on the relation modeling process, leading to a limitation of the performance. We propose an idea of knowledge concretization and further present an end-to-end group activity recognition framework. Our framework first utilizes a Visual Representation Module to extract appearance feature, and then a Knowledge Augmented Semantic Relation Module to extract semantic information and explore the semantic relations. Finally, a Knowledge Augmented Semantic Relation Module integrates visual and semantic information with the help of knowledge. Extensive experiments validate that knowledge enable effectively enhance relation inference process and the individual representations. Benefiting from the design of these modules, our framework achieves competitive experimental results on two widely-used datasets.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with directed graph neural networks,” in Computer Vision and Pattern Recognition , 2019, pp. 7912–7921.
2[2] C. Plizzari, M. Cannici, and M. Matteucci, “Skeleton-based action recognition via spatial and temporal transformer networks,” Computer Vision and Image Understanding , vol. 208, p. 103219, 2021.
3[3] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in International Conference on Computer Vision , 2015, pp. 4489–4497.
4[4] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Computer Vision and Pattern Recognition , 2017, pp. 4724–4733.
5[5] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal model for group activity recognition,” in Computer Vision and Pattern Recognition , 2016, pp. 1971–1980.
6[6] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese, “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” in Computer Vision and Pattern Recognition , 2017, pp. 4315–4324.
7[7] M. S. Ibrahim and G. Mori, “Hierarchical relational networks for group activity recognition and retrieval,” in the European conference on computer vision , 2018, pp. 721–736.
8[8] X. Shu, L. Zhang, Y. Sun, and J. Tang, “Host–parasite: Graph lstm-in-lstm for group activity recognition,” IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 2, pp. 663–674, 2021.