Multi-Granularity Fusion Network for Proposal and Activity Localization: Submission to ActivityNet Challenge 2019 Task 1 and Task 2
Haisheng Su, Xu Zhao, Shuming Liu

TL;DR
This paper introduces a Multi-Granularity Fusion Network that combines proposals from various frameworks to improve temporal action proposal generation and localization, achieving state-of-the-art results in ActivityNet Challenge 2019.
Contribution
The novel MGFN effectively integrates diverse proposal sources considering multiple perspectives, enhancing proposal quality and localization accuracy.
Findings
Achieved 69.85 AUC score in proposal generation
Attained 38.90 mAP in action localization
Outperformed previous methods on ActivityNet Challenge 2019
Abstract
This technical report presents an overview of our solution used in the submission to ActivityNet Challenge 2019 Task 1 (\textbf{temporal action proposal generation}) and Task 2 (\textbf{temporal action localization/detection}). Temporal action proposal indicates the temporal intervals containing the actions and plays an important role in temporal action localization. Top-down and bottom-up methods are the two main categories used for proposal generation in the existing literature. In this paper, we devise a novel Multi-Granularity Fusion Network (MGFN) to combine the proposals generated from different frameworks for complementary filtering and confidence re-ranking. Specifically, we consider the diversity comprehensively from multiple perspectives, e.g. the characteristic aspect, the data aspect, the model aspect and the result aspect. Our MGFN achieves the state-of-the-art performance…
| Setting | AUC (val) |
|---|---|
| APN | 62.45 |
| TAG | 63.97 |
| improved-BSN | 68.18 |
| imporved-BSN+APN | 68.58 |
| CAR | 68.01 |
| improved-BSN+APN+CAR | 69.85 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
Multi-Granularity Fusion Network for Proposal and Activity Localization: Submission to ActivityNet Challenge 2019 Task 1 and Task 2
Haisheng Su, Shuming Liu, Xu Zhao
Department of Automation,
Shanghai Jiao Tong University
{suhaisheng,shumingliu,zhaoxu}@sjtu.edu.cn Corresponding author.
Abstract
This technical report presents an overview of our solution used in the submission to ActivityNet Challenge 2019 Task 1 (temporal action proposal generation) and Task 2 (temporal action localization/detection). Temporal action proposal indicates the temporal intervals containing the actions and plays an important role in temporal action localization. Top-down and bottom-up methods are the two main categories used for proposal generation in the existing literature. In this paper, we devise a novel Multi-Granularity Fusion Network (MGFN) to combine the proposals generated from different frameworks for complementary filtering and confidence re-ranking. Specifically, we consider the diversity comprehensively from multiple perspectives, e.g. the characteristic aspect, the data aspect, the model aspect and the result aspect. Our MGFN achieves the state-of-the-art performance on the temporal action proposal task with 69.85 AUC score and the temporal action localization task with 38.90 mAP on the challenge testing set.
1 Task Introduction
Temporal action detection task has received much attention from many researchers in recent years, which requires not only categorizing the real-world untrimmed videos but also locating the temporal boundaries of action instances. Analogous to object proposals for object detection in images, temporal action proposal indicates the temporal intervals containing the actions and plays an important role in video temporal action detection. It has been commonly recognized that high-quality proposals should have precise temporal boundaries and reliable confidence scores. To cater for these two conditions and achieve high quality proposals, there are two main categories in the existing proposal generation methods [2, 3, 4, 5, 8]. However, the proposals generated in a top-down fashion are doomed to have imprecise boundaries though with regression. Under this circumstance, the other type of methods [7, 11] have drawn much attention in the community recently which tackle this problem in a bottom-up fashion, where the input video is evaluated in a finer-level. [7] is a typical method in this type which proposes the Boundary Sensitive Network (BSN) to generate proposals with flexible durations and reliable confidence scores. Though BSN achieves convincing performance in this manner, it still suffers from many drawbacks. For example, the snippet-level probability sequence of actionness or boundary is sensitive to noises and the inferior quality of confidence score used for proposal retrieving.
2 Approach Overview
In this section, we will introduce the technical details of our approach.
2.1 Video Features Encoding
we adopt the two-stream network [9] in advance to encode the visual features of an input video, where the RGB stream handles a RGB image as input to capture the spatial features, while the flow stream operates on the stacked optical flows to capture the motion information. This kind of architecture has been widely used in action recognition [10] and temporal action detection tasks. As for the characteristic aspect, we try many different ConvNet architectures pre-trained on Kinetics-400 dataset, such as ResNet-50, ResNet-101, ResNet-152, ResNet-200, I3D, P3D, Inception-V3 and Inception-ResNet-V2, to verify the effectiveness, which are then used for feature extraction. Finally, we employ a set of effective feature representations for proposal generation, thus to ensure the feature diversity.
2.2 Data Augmentation
Considering the ground-truth distribution and in order to reduce computational cost, we rescale the length of each feature sequence to a fixed size by linear interpolation before feeding it into our MGFN. As for data augmentation and varied lengths of action instances, we adopt a set of lengths of feature sequence during training phase, such as 64, 100, 128 and 192. Meanwhile, we randomly sample 2000 validation videos and add them to the training set, while leave others for validation.
2.3 Multi-Granularity Fusion Network
APN. Prop-SSAD [6] is a simplified version of SSAD [5] and is the first to perform anchor mechanism on the temporal action proposal generation task, which utilizes several temporal convolution anchor layers with different resolutions to generate proposals with varied lengths. The lower anchor layers are used to locate the short-range action proposals while the higher anchor layers are responsible to cover the long-range action proposals. Through this mechanism, the generated proposals can be densely distributed on each feature map. In this paper, we improve the rank performance with three types of classifiers, namely a binary activity classifier, a completeness classifier and an Intersection-over-Union (IoU) classifier/regressor.
In selecting positive samples for activity classifier during training, proposals that overlap with a ground-truth instance with an IoU larger than 0.7 or lower than 0.3 but the intersection with a ground-truth over its own time span (IoP) above 0.8 will be used. While the proposals are regarded as negative samples only when less than 5% of its time span overlaps with any ground-truth action instances.
As for the completeness classifier, proposals with IoU larger than 0.7 are employed as positive samples, and proposals with while are used as negative samples. As for the IoU classifier, we divide the proposals into three categories according to the IoU values. The value range 0-1 is discretized into three ranges 0-0.3, 0.3-0.7, 0.7-1.0, referred as the background value range, the middle value range and the high value range respectively. With these three classifiers, our Anchor Pyramid Network (APN) can evaluate the proposals comprehensively with complicated situations, and we use 7 anchor layers to predict the proposals with 512 feature maps. During inference stage, we fuse the outputs of three classifiers to obtain the confidence score for each proposal:
[TABLE]
where indicate the actionness score, the completeness score and the IoU score respectively. However, the proposals generated in this way are doomed to have imprecise boundaries though with regression.
TAG. TAG [11] first evaluates the snippet-level actionness indicating whether the snippet is inside the action instances, then adopts watershed algorithm to group the consecutive snippets with two set of thresholds. Proposals generated in this bottom-up fashion are more sensitive to the temporal boundaries than anchor-based methods. However, proposals generated by TAG can not be further retrieved without confidence scores evaluated in a global view.
Improved-BSN. BSN [7] also generates the proposals in a bottom-up fashion which first evaluates the probabilities of each temporal location being in the starting, ending and middling regions. Then through combining the high probability boundary locations, it can generate abundant proposals with flexible durations and confidence scores. However, the probability sequences predicted by a simple three-layer temporal convolution network are sensitive to noises, causing many false alarms and low precision. Besides, the performance of confidence scores used for proposal retrieving are also limited owing to the inferior proposal-level representations. In this paper, we further promote the performance of BSN through improving the quality of probability sequence with several edge-smoothing strategies and the proposal-level representations used for ranking. Besides, we unify the training process for a robust optimization.
Complementary Filtering, Temporal Boundary Adjustment and Proposal Ranking Model (CAR). As we discussed above, proposals generated by the actionness score grouping method and the anchor/sliding window based method are complementary with each other. Specifically, the proposals generated by the anchor-based method can uniformly cover the whole videos while with imprecise boundaries, and the grouping based method can generate proposals with more precise boundaries but rely greatly on the actionness score, especially when the actionness score is low, it may dismiss some potential proposals. Under this circumstance, we first train a binary classifier with ground-truth action instances as input, while use the proposal results of TAG as label set. In selecting the positive samples, if the ground-truth instances overlap with a proposal of TAG with an IoU larger than 0.5, the input ground-truths will be labeled as 1, otherwise 0. During testing phase, we feed the proposal results of APN to the binary classifier in order to select the proposals with low scores ( 0.5). Then the selected proposals from APN are combined with TAG for boundary adjustment and proposal ranking, with three-stage (left, central, right) unit features as input to the multi-layer perceptron model respectively.
Proposal Re-ranking. Since both the proposal generation and quality evaluation can influence the evaluation of proposals, we re-rank the proposals of improved BSN as:
[TABLE]
where indicate the starting probability, ending probability and IoU score of the proposal predicted by the improved-BSN. And indicates the confidence of the proposal of APN which has the maximum with .
3 Experiment Results
3.1 Evaluation Metrics
For temporal action proposal generation task, Average Recall (AR) calculated under different tIoU thresholds is commonly adopted as one evaluation metric. In this challenge, the thresholds are set from 0.5 to 0.95 with a step size of 0.05. And the area under the Average Recall vs. Average Number of Proposals (AN) curve (AUC) are used as the final evaluation metric, where AN ranges from 0 to 100.
For temporal action localization task, mean Average Precision (mAP) is a conventional evaluation metric, where Average Precision (AP) is calculated for each category respectively. In this challenge, the average mAP with tIoU thresholds from 0.5 to 0.95 with a step size of 0.05 is reported.
3.2 Temporal Action Proposal and Localization
The performance of APN, TAG, improved-BSN and CAR on the validation set of ActivityNet-1.3 are shown in the Table 1. For model fusion process, we consider the inter-model and intra-model fusion respectively. We obtain the results of each model with multi-modality fusion of different feature representations and testing scales. As for multi-scale testing, we not only concatenate the outputs of different fixed scales but also the free scales of videos. As shown in Fig. 1, we illustrate the difference of ground-truth AUC distribution between fix-scale and free-scale testing on the validation set of ActivityNet-1.3. And we can observe that the ground-truth AUC contribution of free-scale testing is superior than fix-scale testing when the duration of ground-truth is relatively short. What’s more, we further perform the score fusion between the improved-BSN and APN to re-rank the proposals.
For task 1, through merging the improved-BSN & APN model and CAR model by Soft-NMS [1], we achieve 69.85 AUC score on both validation set and testing server.
For task 2, our improved-BSN & APN achieves 38.90 mAP on the testing server and win the third place of temporal action localization task in ActivityNet Challenge 2019.
4 Conclusion
In this challenge technical notebook, we comprehensively analyze the complementary characteristics of bottom-up and top-down proposal generation methods, and our enhanced APN, BSN and CSR models all contribute to the performance improvement. Specifically, with the smoothed probability sequence, our improved-BSN can generate the proposals with higher precision. And with three additional classifiers, our improved-APN can evaluate the proposals more reasonably. All these improvements can also reveal the direction of how to make better temporal action proposal generation and localization.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Improving object detection with one line of code. ar Xiv preprint ar Xiv:1704.04503 , 2017.
- 2[2] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6373–6382. IEEE, 2017.
- 3[3] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In European Conference on Computer Vision , pages 768–784. Springer, 2016.
- 4[4] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Computer Vision (ICCV), 2017 IEEE International Conference on , pages 3648–3656. IEEE, 2017.
- 5[5] T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference , pages 988–996. ACM, 2017.
- 6[6] T. Lin, X. Zhao, and Z. Shou. Temporal convolution based action proposal: Submission to activitynet 2017. ar Xiv preprint ar Xiv:1707.06750 , 2017.
- 7[7] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundary sensitive network for temporal action proposal generation. ar Xiv preprint ar Xiv:1806.02964 , 2018.
- 8[8] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR , pages 1049–1058, 2016.
