Relation-Aware Pyramid Network (RapNet) for temporal action proposal

Jialin Gao; Zhixiang Shi; Jiani Li; Yufeng Yuan; Jiwei Li; Xi Zhou

arXiv:1908.03448·cs.CV·August 12, 2019

Relation-Aware Pyramid Network (RapNet) for temporal action proposal

Jialin Gao, Zhixiang Shi, Jiani Li, Yufeng Yuan, Jiwei Li, Xi Zhou

PDF

Open Access

TL;DR

This paper introduces RapNet, a relation-aware pyramid network for generating temporal action proposals in videos, utilizing a fine-tuned CNN, multi-scale proposals, boundary adjustment, and ensemble methods, achieving second place in a challenge.

Contribution

The paper presents a novel relation-aware pyramid network architecture for improved temporal action proposal generation in videos.

Findings

01

Achieved 2nd place in ActivityNet Challenge 2019.

02

Effective multi-scale proposal generation with boundary adjustment.

03

Enhanced performance through ensemble methods.

Abstract

In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video representations and then we design a Relation-Aware Pyramid Network (RapNet) to generate temporal multiscale proposals with confidence score. After that, we employ a two-stage snippet-level boundary adjustment scheme to re-rank the order of generated proposals. Ensemble methods are also been used to improve the performance of our solution, which helps us achieve 2nd place.

Tables2

Table 1. Table 1: Our backbone ResNet-50 C3D model, when k = 6 𝑘 6 k=6 . The input with L × 224 × 224 𝐿 224 224 L\times 224\times 224 dimensions was performed downsampling in the temporal size only at layer m a x p o o l 2 𝑚 𝑎 𝑥 𝑝 𝑜 𝑜 subscript 𝑙 2 maxpool_{2} . T × H × W 𝑇 𝐻 𝑊 T\times H\times W represents the dimensions on time, height, weight of filter kernel size and output feature maps.

Layer name	Net Architecture	Output size
$c o n v_{1}$	$1 \times 7 \times 7$ , 64, s(1, 2, 2)	$L \times 112 \times 112$
$m a x p o o l_{1}$	$1 \times 3 \times 3$ , s(1, 2, 2)	$L \times 56 \times 56$
$r e s_{2}$	$\begin{matrix} 1 \times 1 \times 1 & 64 \\ 1 \times 3 \times 3 & 64 \\ 1 \times 1 \times 1 & 256 \end{matrix} \times 3$	$L \times 56 \times 56$
$m a x p o o l_{2}$	$2 \times 1 \times 1$ , s(2, 1, 1)	$L / 2 \times 56 \times 56$
$r e s_{3}$	$\begin{matrix} 1 \times 1 \times 1 & 128 \\ 1 \times 3 \times 3 & 128 \\ 1 \times 1 \times 1 & 512 \end{matrix} \times 4$	$L / 2 \times 28 \times 28$
$r e s_{4}$	$\begin{matrix} 3 \times 1 \times 1 & 256 \\ 1 \times 3 \times 3 & 256 \\ 1 \times 1 \times 1 & 1024 \end{matrix} \times k$	$L / 2 \times 14 \times 14$
$r e s_{5}$	$\begin{matrix} 3 \times 1 \times 1 & 512 \\ 1 \times 3 \times 3 & 512 \\ 1 \times 1 \times 1 & 2048 \end{matrix} \times 3$	$L / 2 \times 7 \times 7$
$p o o l_{3}$	global average pool	$1 \times 1 \times 1$

Table 2. Table 2: Comparison between our RAPNet with boundary adjustment schemes on ActivityNet validation in terms of AR@AN and AUC

Method	AR@1	AR@5	AR@10	AR@100	AUC(val)
APG	34.70%	49.65%	57.23%	78.21%	69.61%
+PEM	35.21%	50.29%	58.17%	78.74%	70.35%
+TAG	35.30%	50.44%	58.31%	79.15%	70.65%

Equations6

λ_{co n f} [\frac{1}{N _{p os}} i = 0 \sum N - 1 j = 0 \sum \frac{T}{2 ^{i}} k = 0 \sum M Δ_{ij k}^{in s} f_{co n f} (\overset{p}{^}_{co n f}^{j k}, p_{co n f}^{j k})

λ_{co n f} [\frac{1}{N _{p os}} i = 0 \sum N - 1 j = 0 \sum \frac{T}{2 ^{i}} k = 0 \sum M Δ_{ij k}^{in s} f_{co n f} (\overset{p}{^}_{co n f}^{j k}, p_{co n f}^{j k})

+ \frac{1}{N _{n e g}} i = 0 \sum N - 1 j = 0 \sum \frac{T}{2 ^{i}} k = 0 \sum M Λ_{ij k}^{in s} f_{co n f} (\overset{p}{^}_{co n f}^{j k}, p_{co n f}^{j k})]

+ λ_{c} \frac{1}{N _{p os}} i = 0 \sum N - 1 j = 0 \sum \frac{T}{2 ^{i}} k = 0 \sum M Δ_{ij k}^{in s} f_{c} (\overset{p}{^}_{c}^{j k}, p_{c}^{j k})

+ λ_{w} \frac{1}{N _{p os}} i = 0 \sum N - 1 j = 0 \sum \frac{T}{2 ^{i}} k = 0 \sum M Δ_{ij k}^{in s} f_{w} (\overset{p}{^}_{w}^{j k}, p_{w}^{j k})

+ λ_{i o u} \frac{1}{N _{p os}} i = 0 \sum N - 1 j = 0 \sum \frac{T}{2 ^{i}} k = 0 \sum M Δ_{ij k}^{in s} (1 - i o u_{j k})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Analysis and Summarization

Full text

Relation-Aware Pyramid Network (RapNet) for temporal action proposal :

Submission to ActivityNet Challenge 2019

Jialin Gao

Cooperative Medianet Innovation Center

Shanghai Jiao Tong University

[email protected]

Zhixiang Shi, Jiani Li, Yufeng Yuan, Jiwei Li, Xi Zhou

CloudWalk Technology Co., Ltd

{shizhixiang, lijiani, yuanyufeng, lijiwei, zhouxi}@cloudwalk.cn

Abstract

In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video representations and then we design a Relation-Aware Pyramid Network (RapNet) to generate temporal multiscale proposals with confidence score. After that, we employ a two-stage snippet-level boundary adjustment scheme to re-rank the order of generated proposals. Ensemble methods are also been used to improve the performance of our solution, which helps us achieve 2nd place.

1 Video Representation

For temporal action proposal task in ActivityNet Challenge 2019, we employ the two-stage strategy, which requires encoding visual content of video and generating proposal sequentially. In this section, we describe how to obtain video representions for proposal generation.

First, we extract RGB frames from the original video and then adopt an Residual C3D network [4, 8] to encode the visual context. In order to obtain compact features, we compose snippets sequence $\Omega=\{\omega\}_{i=1}^{T^{\prime}}$ of a given video, where $T^{\prime}$ is the rescaled temporal length of snippets and each snippet $\omega_{i}$ with $L$ frames. Our Residual C3D network takes these snippets as input to generate encoded representations of a given video.

For details, we show the backbone C3D network in Tab.1 and we use ResNet-50 and ResNet-101 for this task. The net is first trained on Kinetics [2] with ImageNet [3] pretrain weights provided in Pytorch Model Zoo, which got 71.48% top-1 accuracy and then fine-tuned on ActivityNet-v1.3 dataset. Finally, we add another fully connected layer to generate 256-dimension vector for each snippet so that a video can be represented as a $T^{\prime}\times 256$ feature map. Due to the different temporal length between videos, we resize them to a fixed temporal length so that every video representation is $128\times 256$ for proposasl generation.

2 Action Proposal Generation

In this section, we describe our Relation-Aware Pyramid Network (RapNet) for temporal proposal generation based on the above video representations. We employ anchor based method and design a temporal pyramid network enhanced with self-attention module [9] to generate multi-scale proposals. First, we use K-means to cluster a certain number of anchors, similar to YOLO [7]. Then we generate multi-scale proposals based on these anchors via our RapNet, which is shown in Fig.1.

Pre-defined anchors: According to the annotation, we employ K-means algorithm to select anchor boxes. In experiment, we find 12 anchors achieve the best performance, which indicates every generators has 2 anchor boxes.

Multiscale proposal generator: In order to enlarge the receptive field of temporal convolution and capture the long-range dependency, we apply self-attention [9] module to take into the relationship between snippets on top-down process and use 1D FPN [6] on bottom-up to generate multi-scale proposals with 6 generators. This design can help these features in different scales capturing the long-range contextual information for better representation of visual content.

Label assignment. For anchor-based proposal prediction, we tag a binary label for each anchor instance, similar to YOLO [7], that a positive label is assigned to the one with highest Interaction-over-Union (IoU) with corresponding ground-truth instance, otherwise negative. In this way, a ground truth instance only can match one anchor so that the boundary regression and IoU loss only consider positive samples. Due to this label assignment, it is imbalanced that the ratio between positive and negative training samples. Thus we adopt a screening strategy to ignore some negative instances for confidence loss. A negative instance will be ignored if the highest IoU overlap between ground-truth instances with all proposal predictions is larger than a threshold $\theta_{iou}$ .

Loss function: We use these loss functions: objectness loss, regression function, and IOU loss, which share the same definition as previous works. Hence, we optimize the following loss function $L_{prop}$ :

[TABLE]

where $N_{pos}$ and $N_{neg}$ represent the number of positive $\Delta^{ins}_{ijk}$ and screened negative $\Lambda^{ins}_{ijk}$ training instances respectively while ${p}^{jk}_{conf}$ is the binary label in label assignment, $f_{conf}(\cdot)$ and $f_{c}(\cdot)$ are binary cross-entropy with logits loss functions. $f_{w}(\cdot)$ represents smooth- $L_{1}$ loss and $iou_{jk}$ is the interaction-over-union between a prediction with ground truth. We set $\lambda_{conf}=0.2$ and other weights as 1 for training RapNet.

3 Boundary Adjustment

Due to the start and end point of an activity is usually not very clear even if considering the global context, we degisn a two-stage strategy to adjust the boundary of the proposals generated via our RAPNet above. First we employ the refined PEM module in BSN [5] with frames actioness to refine the boundary of proposals. And then we suppres the redundant proposals with soft-NMS [1] to obtain reordering results. Finally, with the frames actioness, we can apply watershed algorithm in TAG [10] to adjust proposals’ boundary again. The improvement of these method is shown in Tab.2.

Our single model APG can achieve 69.61% AUC and PEM module can improve the results by 0.74%, TAG refinement can still increase the AUC to 70.65%, which indicates our boundary adjustmnet is very effective for temporal action proposal.

4 Ensemble

In order to enhance the performance of our solution, we introduce several improvements as following:

•

Video representations: we use ResNet-50 C3D and ResNet-101 C3D to encode the visual content of video.

•

Anchor boxes: we use different anchor boxes, namely anchor 12 and 18, to generate different proposals

After ensemble, we achieves 71.51% on the validation set and 71.38% on the testing server.

5 Conclusion

In this work, we propose a novel action proposal generation network enhanced with self-attention module and FPN, called Relation-Aware Pyramid Network (RAPNet), for temporal proposal task. We also introduce a two-stage scheme to refine the boundary of proposals to improve the performance.

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision , pages 5561–5569, 2017.
2[2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017.
3[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
4[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.
5[5] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 3–19, 2018.
6[6] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017.
7[7] J. Redmon and A. Farhadi. Yolov 3: An incremental improvement. ar Xiv preprint ar Xiv:1804.02767 , 2018.
8[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision , pages 4489–4497, 2015.