Relation-Aware Pyramid Network (RapNet) for temporal action proposal
Jialin Gao, Zhixiang Shi, Jiani Li, Yufeng Yuan, Jiwei Li, Xi Zhou

TL;DR
This paper introduces RapNet, a relation-aware pyramid network for generating temporal action proposals in videos, utilizing a fine-tuned CNN, multi-scale proposals, boundary adjustment, and ensemble methods, achieving second place in a challenge.
Contribution
The paper presents a novel relation-aware pyramid network architecture for improved temporal action proposal generation in videos.
Findings
Achieved 2nd place in ActivityNet Challenge 2019.
Effective multi-scale proposal generation with boundary adjustment.
Enhanced performance through ensemble methods.
Abstract
In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video representations and then we design a Relation-Aware Pyramid Network (RapNet) to generate temporal multiscale proposals with confidence score. After that, we employ a two-stage snippet-level boundary adjustment scheme to re-rank the order of generated proposals. Ensemble methods are also been used to improve the performance of our solution, which helps us achieve 2nd place.
| Layer name | Net Architecture | Output size |
|---|---|---|
| , 64, s(1, 2, 2) | ||
| , s(1, 2, 2) | ||
| , s(2, 1, 1) | ||
| global average pool |
| Method | AR@1 | AR@5 | AR@10 | AR@100 | AUC(val) |
|---|---|---|---|---|---|
| APG | 34.70% | 49.65% | 57.23% | 78.21% | 69.61% |
| +PEM | 35.21% | 50.29% | 58.17% | 78.74% | 70.35% |
| +TAG | 35.30% | 50.44% | 58.31% | 79.15% | 70.65% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Analysis and Summarization
Relation-Aware Pyramid Network (RapNet) for temporal action proposal :
Submission to ActivityNet Challenge 2019
Jialin Gao
Cooperative Medianet Innovation Center
Shanghai Jiao Tong University
Zhixiang Shi, Jiani Li, Yufeng Yuan, Jiwei Li, Xi Zhou
CloudWalk Technology Co., Ltd
{shizhixiang, lijiani, yuanyufeng, lijiwei, zhouxi}@cloudwalk.cn
Abstract
In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video representations and then we design a Relation-Aware Pyramid Network (RapNet) to generate temporal multiscale proposals with confidence score. After that, we employ a two-stage snippet-level boundary adjustment scheme to re-rank the order of generated proposals. Ensemble methods are also been used to improve the performance of our solution, which helps us achieve 2nd place.
1 Video Representation
For temporal action proposal task in ActivityNet Challenge 2019, we employ the two-stage strategy, which requires encoding visual content of video and generating proposal sequentially. In this section, we describe how to obtain video representions for proposal generation.
First, we extract RGB frames from the original video and then adopt an Residual C3D network [4, 8] to encode the visual context. In order to obtain compact features, we compose snippets sequence of a given video, where is the rescaled temporal length of snippets and each snippet with frames. Our Residual C3D network takes these snippets as input to generate encoded representations of a given video.
For details, we show the backbone C3D network in Tab.1 and we use ResNet-50 and ResNet-101 for this task. The net is first trained on Kinetics [2] with ImageNet [3] pretrain weights provided in Pytorch Model Zoo, which got 71.48% top-1 accuracy and then fine-tuned on ActivityNet-v1.3 dataset. Finally, we add another fully connected layer to generate 256-dimension vector for each snippet so that a video can be represented as a feature map. Due to the different temporal length between videos, we resize them to a fixed temporal length so that every video representation is for proposasl generation.
2 Action Proposal Generation
In this section, we describe our Relation-Aware Pyramid Network (RapNet) for temporal proposal generation based on the above video representations. We employ anchor based method and design a temporal pyramid network enhanced with self-attention module [9] to generate multi-scale proposals. First, we use K-means to cluster a certain number of anchors, similar to YOLO [7]. Then we generate multi-scale proposals based on these anchors via our RapNet, which is shown in Fig.1.
Pre-defined anchors: According to the annotation, we employ K-means algorithm to select anchor boxes. In experiment, we find 12 anchors achieve the best performance, which indicates every generators has 2 anchor boxes.
Multiscale proposal generator: In order to enlarge the receptive field of temporal convolution and capture the long-range dependency, we apply self-attention [9] module to take into the relationship between snippets on top-down process and use 1D FPN [6] on bottom-up to generate multi-scale proposals with 6 generators. This design can help these features in different scales capturing the long-range contextual information for better representation of visual content.
Label assignment. For anchor-based proposal prediction, we tag a binary label for each anchor instance, similar to YOLO [7], that a positive label is assigned to the one with highest Interaction-over-Union (IoU) with corresponding ground-truth instance, otherwise negative. In this way, a ground truth instance only can match one anchor so that the boundary regression and IoU loss only consider positive samples. Due to this label assignment, it is imbalanced that the ratio between positive and negative training samples. Thus we adopt a screening strategy to ignore some negative instances for confidence loss. A negative instance will be ignored if the highest IoU overlap between ground-truth instances with all proposal predictions is larger than a threshold .
Loss function: We use these loss functions: objectness loss, regression function, and IOU loss, which share the same definition as previous works. Hence, we optimize the following loss function :
[TABLE]
where and represent the number of positive and screened negative training instances respectively while is the binary label in label assignment, and are binary cross-entropy with logits loss functions. represents smooth- loss and is the interaction-over-union between a prediction with ground truth. We set and other weights as 1 for training RapNet.
3 Boundary Adjustment
Due to the start and end point of an activity is usually not very clear even if considering the global context, we degisn a two-stage strategy to adjust the boundary of the proposals generated via our RAPNet above. First we employ the refined PEM module in BSN [5] with frames actioness to refine the boundary of proposals. And then we suppres the redundant proposals with soft-NMS [1] to obtain reordering results. Finally, with the frames actioness, we can apply watershed algorithm in TAG [10] to adjust proposals’ boundary again. The improvement of these method is shown in Tab.2.
Our single model APG can achieve 69.61% AUC and PEM module can improve the results by 0.74%, TAG refinement can still increase the AUC to 70.65%, which indicates our boundary adjustmnet is very effective for temporal action proposal.
4 Ensemble
In order to enhance the performance of our solution, we introduce several improvements as following:
- •
Video representations: we use ResNet-50 C3D and ResNet-101 C3D to encode the visual content of video.
- •
Anchor boxes: we use different anchor boxes, namely anchor 12 and 18, to generate different proposals
After ensemble, we achieves 71.51% on the validation set and 71.38% on the testing server.
5 Conclusion
In this work, we propose a novel action proposal generation network enhanced with self-attention module and FPN, called Relation-Aware Pyramid Network (RAPNet), for temporal proposal task. We also introduce a two-stage scheme to refine the boundary of proposals to improve the performance.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision , pages 5561–5569, 2017.
- 2[2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017.
- 3[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
- 4[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.
- 5[5] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 3–19, 2018.
- 6[6] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2117–2125, 2017.
- 7[7] J. Redmon and A. Farhadi. Yolov 3: An incremental improvement. ar Xiv preprint ar Xiv:1804.02767 , 2018.
- 8[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision , pages 4489–4497, 2015.
