RefineLoc: Iterative Refinement for Weakly-Supervised Action   Localization

Alejandro Pardo; Humam Alwassel; Fabian Caba Heilbron; Ali Thabet,; Bernard Ghanem

arXiv:1904.00227·cs.CV·November 10, 2020

RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization

Alejandro Pardo, Humam Alwassel, Fabian Caba Heilbron, Ali Thabet,, Bernard Ghanem

PDF

1 Repo

TL;DR

RefineLoc introduces an iterative refinement method for weakly-supervised temporal action localization, leveraging pseudo ground truths to improve detection accuracy without requiring detailed annotations.

Contribution

The paper proposes a novel iterative refinement approach that enhances weakly-supervised action localization by training on pseudo ground truths, outperforming existing methods.

Findings

01

Achieves competitive results on ActivityNet v1.2 and THUMOS14 datasets.

02

Significantly improves performance of existing state-of-the-art methods.

03

Sets a new state-of-the-art on THUMOS14 dataset.

Abstract

Video action detectors are usually trained using datasets with fully-supervised temporal annotations. Building such datasets is an expensive task. To alleviate this problem, recent methods have tried to leverage weak labeling, where videos are untrimmed and only a video-level label is available. In this paper, we propose RefineLoc, a novel weakly-supervised temporal action localization method. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. We show the benefit of this iterative approach and present an extensive analysis of five different pseudo ground truth generators. We show the effectiveness of our model on two standard action datasets, ActivityNet v1.2 and THUMOS14. RefineLoc shows competitive results with the state-of-the-art in weakly-supervised temporal localization. Additionally, our iterative…

Tables19

Table 1. Table 1: Effects of pseudo ground truth generator and loss trade-off coefficient β 𝛽 \beta on ActivityNet v1.2 . The segment prediction-based generator with β = 4 𝛽 4 \beta=4 shows the highest performance (underlined). Bold numbers mark the best performing generator for each β 𝛽 \beta .

Pseudo Ground	$β$
Truth Generator	0	1	2	4	8	16
Uniform Random	— 9.66 —	9.66	9.66	9.66	9.66	9.66
Distribution Aware		17.39	19.10	20.00	17.73	18.30
Class Activation		23.09	23.02	22.93	22.86	22.85
Attention		23.15	23.13	22.97	23.00	22.94
Segment Prediction		23.04	23.15	23.24	23.11	23.09

Table 2. Table 2: Effects of refinement . We show the gain from our iterative refinement on ActivityNet v1.2. Note the significant improvement over iterations: 13.58 % percent 13.58 13.58\% in 3 3 3 iterations.

Refinement Iteration	0	1	2	3	4	5
RefineLoc	9.66	19.14	22.66	23.24	22.94	22.95

Table 3. (a) Methods using TSN features

Method	0.5	0.75	0.95	Avg.
UntrimmedNets [62]	7.4	3.2	0.7	3.6
AutoLoc [53]	27.3	15.1	3.3	16.0
TSM [69]	28.3	17.0	3.5	17.1
CMCS [35]	33.9	19.9	5.1	20.5
CleanNet [36]	37.1	20.3	5.0	21.6
RefineLoc ( $η = 0$ )	25.8	11.5	2.8	13.3
RefineLoc ( $η = 5$ )	38.8	22.2	5.3	23.2

Table 4. (a) Methods using TSN features

Method	0.5	0.75	0.95	Avg.
UntrimmedNets [62]	7.4	3.2	0.7	3.6
AutoLoc [53]	27.3	15.1	3.3	16.0
TSM [69]	28.3	17.0	3.5	17.1
CMCS [35]	33.9	19.9	5.1	20.5
CleanNet [36]	37.1	20.3	5.0	21.6
RefineLoc ( $η = 0$ )	25.8	11.5	2.8	13.3
RefineLoc ( $η = 5$ )	38.8	22.2	5.3	23.2

Table 5. (b) Methods using I3D features

Method	0.5	0.75	0.95	Avg.
W-TALC [46]	37.0	-	-	18.0
3C-Net [39]	35.4	-	-	21.1
3C-Net $†$ [39]	37.2	-	-	21.7
CMCS [35]	36.8	22.0	5.6	22.4
BaS-Net [33]	38.5	24.2	5.6	24.3
RefineLoc ( $η = 0$ )	19.2	8.0	2.3	9.7
RefineLoc ( $η = 3$ )	38.7	22.6	5.5	23.2

Table 6. (a) Methods using TSN features

Method	0.3	0.4	0.5	0.6	0.7
UntrimmedNets [62]	28.2	21.1	13.7	-	-
W-TALC [46]	32.0	26.0	18.8	-	6.2
CMCS [35]	37.5	29.1	19.9	12.3	6.0
AutoLoc [53]	35.8	29.0	19.9	12.3	6.0
CleanNet [36]	37.5	29.1	23.9	13.9	7.1
BaS-Net [33]	42.8	34.7	25.1	17.1	9.3
RefineLoc ( $η = 0$ )	7.0	4.2	2.9	1.3	0.6
RefineLoc ( $η = 4$ )	36.1	29.6	22.6	12.1	5.8

Table 7. (a) Methods using TSN features

Method	0.3	0.4	0.5	0.6	0.7
UntrimmedNets [62]	28.2	21.1	13.7	-	-
W-TALC [46]	32.0	26.0	18.8	-	6.2
CMCS [35]	37.5	29.1	19.9	12.3	6.0
AutoLoc [53]	35.8	29.0	19.9	12.3	6.0
CleanNet [36]	37.5	29.1	23.9	13.9	7.1
BaS-Net [33]	42.8	34.7	25.1	17.1	9.3
RefineLoc ( $η = 0$ )	7.0	4.2	2.9	1.3	0.6
RefineLoc ( $η = 4$ )	36.1	29.6	22.6	12.1	5.8

Table 8. (b) Methods using I3D features

Method	0.3	0.4	0.5	0.6	0.7
W-TALC [46]	40.1	31.1	22.8	-	7.6
CMCS [35]	41.2	32.1	23.1	15.0	7.0
TSM [69]	39.5	-	24.5	-	7.1
3C-Net [46]	40.9	32.3	24.6	-	7.7
3C-Net $†$ [46]	44.2	34.1	26.6	-	8.1
Nguyen et al. [42]	46.6	37.5	26.8	17.6	9.0
BaS-Net [33]	44.6	36.0	27.0	18.6	10.4
RefineLoc ( $η = 0$ )	34.8	27.7	19.5	10.7	4.60
RefineLoc ( $η = 14$ )	40.8	32.7	23.1	13.3	5.3

Table 9. (a) Generalizability of RefineLoc to other base models

Method	0.3	0.4	0.5	0.6	0.7
W-TALC Code [46]	42.98	34.59	26.99	17.74	9.42
W-TALC Code + RefineLoc	44.10	35.08	27.66	17.67	9.14

Table 10. (a) Generalizability of RefineLoc to other base models

Method	0.3	0.4	0.5	0.6	0.7
W-TALC Code [46]	42.98	34.59	26.99	17.74	9.42
W-TALC Code + RefineLoc	44.10	35.08	27.66	17.67	9.14

Table 11. (b) Generalizability of RefineLoc to other base models

Method	0.3	0.4	0.5	0.6	0.7
BaS-Net Code [33]	43.40	35.16	26.26	18.59	10.16
BaS-Net Code + RefineLoc	45.10	36.50	28.03	18.95	10.36

Table 12. (a) ActivityNet v1.2 with TSN features

Pseudo Ground	$β$
Truth Generator	0	1	2	4	8	16
Uniform Random	— 13.15 —	13.15	13.15	13.15	13.15	13.15
Distribution Aware		15.27	19.22	18.76	20.80	20.96
Class Activation		22.95	22.90	22.53	22.55	22.23
Attention		23.15	22.90	22.47	22.57	22.36
Segment Prediction		23.09	23.16	22.98	23.02	22.88

Table 13. (a) ActivityNet v1.2 with TSN features

Pseudo Ground	$β$
Truth Generator	0	1	2	4	8	16
Uniform Random	— 13.15 —	13.15	13.15	13.15	13.15	13.15
Distribution Aware		15.27	19.22	18.76	20.80	20.96
Class Activation		22.95	22.90	22.53	22.55	22.23
Attention		23.15	22.90	22.47	22.57	22.36
Segment Prediction		23.09	23.16	22.98	23.02	22.88

Table 14. (b) THUMOS14 with I3D features

Pseudo Ground	$β$
Truth Generator	0	1	2	4	8	16
Uniform Random	— 19.45 —	21.12	20.20	19.78	19.45	19.45
Distribution Aware		20.69	20.32	19.45	19.45	19.45
Class Activation		20.18	20.11	20.10	20.21	20.34
Attention		19.45	19.45	19.45	19.45	19.45
Segment Prediction		21.48	22.60	21.55	20.85	21.09

Table 15. (c) THUMOS14 with TSN features

Pseudo Ground	$β$
Truth Generator	0	1	2	4	8	16
Uniform Random	— 2.90 —	17.97	17.64	18.60	18.96	16.64
Distribution Aware		14.89	14.73	14.90	16.42	14.50
Class Activation		12.12	12.32	13.66	12.98	13.28
Attention		20.70	21.37	21.10	20.64	19.66
Segment Prediction		20.92	21.87	22.63	21.13	20.64

Table 16. (a) ActivityNet v1.2 using TSN features

Refinement Iteration	0	1	2	3	4	5
RefineLoc	13.27	21.62	22.76	23.09	22.68	23.23

Table 17. (a) ActivityNet v1.2 using TSN features

Refinement Iteration	0	1	2	3	4	5
RefineLoc	13.27	21.62	22.76	23.09	22.68	23.23

Table 18. (b) THUMOS14 using I3D features

Refinement Iteration	0	3	6	9	12	14
RefineLoc	19.45	20.96	21.36	22.46	21.87	23.12

Table 19. (c) THUMOS14 using TSN features

Refinement Iteration	0	1	2	3	4	5
RefineLoc	2.90	11.13	18.73	20.60	22.63	20.12

Equations7

\overset{ˉ}{A}_{t, i}^{b f} = \frac{exp ( A _{t, i} )}{exp ( A _{t, 1} ) + exp ( A _{t, 2} )},

\overset{ˉ}{A}_{t, i}^{b f} = \frac{exp ( A _{t, i} )}{exp ( A _{t, 1} ) + exp ( A _{t, 2} )},

\overset{ˉ}{A}_{t, i}^{t im e} = \frac{exp ( A ˉ _{t, i}^{b f} )}{\sum _{t^{'} = 1}^{T} exp ( A ˉ _{t^{'}, i}^{b f} )} .

s = \frac{1}{t _{2} - t _{1} + 1} t = t_{1} \sum t_{2} (\overset{ˉ}{A}_{t}^{t im e} + \overset{ˉ}{C}_{t, n}) + \hat{y}_{n} .

s = \frac{1}{t _{2} - t _{1} + 1} t = t_{1} \sum t_{2} (\overset{ˉ}{A}_{t}^{t im e} + \overset{ˉ}{C}_{t, n}) + \hat{y}_{n} .

loss = L (\hat{y}, y) + β \frac{1}{T} t = 1 \sum T L (\overset{ˉ}{A}_{t}^{b f}, G^{M_{η}} (t)),

loss = L (\hat{y}, y) + β \frac{1}{T} t = 1 \sum T L (\overset{ˉ}{A}_{t}^{b f}, G^{M_{η}} (t)),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HumamAlwassel/RefineLoc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization

Alejandro Pardo1 Humam Alwassel111footnotemark: 1 Fabian Caba Heilbron2 Ali Thabet1 Bernard Ghanem1

1King Abdullah University of Science and Technology (KAUST) 2Adobe Research

{alejandro.pardo,humam.alwassel,ali.thabet,bernard.ghanem}@kaust.edu.sa [email protected]

http://humamalwassel.com/publication/refineloc indicates equal contribution.

Abstract

Video action detectors are usually trained using datasets with fully-supervised temporal annotations. Building such datasets is an expensive task. To alleviate this problem, recent methods have tried to leverage weak labeling, where videos are untrimmed and only a video-level label is available. In this paper, we propose RefineLoc, a novel weakly-supervised temporal action localization method. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. We show the benefit of this iterative approach and present an extensive analysis of five different pseudo ground truth generators. We show the effectiveness of our model on two standard action datasets, ActivityNet v1.2 and THUMOS14. RefineLoc shows competitive results with the state-of-the-art in weakly-supervised temporal localization. Additionally, our iterative refinement process is able to significantly improve the performance of two state-of-the-art methods, setting a new state-of-the-art on THUMOS14.

1 Introduction

Weak supervision has emerged as an effective way to train computer vision models using labels that are easy and cheap to acquire. This training strategy is particularly relevant for video tasks, where data collection and annotation costs are prohibitively expensive. In this paper, our goal is to localize actions in time when no information about the start and end times of these actions is available. The lack of temporal supervision makes it challenging to train models that discriminate between action and background segments. Recent methods for weakly-supervised temporal action localization focus on learning class activation maps using soft-attention [62], regularizing attention with an L1 loss [41], or leveraging co-activity and multiple instance learning losses [46]. Alternatively, other methods [53, 36] have focus on generating temporal boundaries using priors such as those encouraged by contrastive losses. All previous methods provide elegant strategies to localize actions in a weakly-supervised manner; however, they are all trained in a single shot and disregard all temporal cues. As a result, their performance lags far behind that of fully-supervised methods trained on temporal action annotations.

In the object detection domain, refining using pseudo ground truth considerably reduces the performance gap between fully and weakly-supervised object detectors [59, 71]. Such pseudo ground truth refers to a set of sampled object predictions from a weakly-supervised model, which are assumed as actual object locations in the next refinement iteration. However, these methods are not directly applicable to temporal action localization. We argue this is in part due to the lack of reliable unsupervised region proposals as in object detection.

In this paper, we propose RefineLoc, a weakly-supervised temporal localization method, which incorporates an iterative refinement strategy by leveraging pseudo ground truth. Figure 1 shows an example of the iterative refinement process RefineLoc employs via pseudo ground truth generation. Contrary to object detection methods, we build our refinement strategy to operate over snippet-level attention and classification modules, making it suitable for temporal localization.

The intuition behind our iterative refinement is to leverage a weakly-supervised model, which captures decent temporal cues about actions, to annotate snippets with pseudo foreground (action) and background (no action). This pseudo ground truth is then used to train a snippet-level attention module in a supervised manner. Although such pseudo labels are noisy, it has been shown that neural networks are reasonably robust against such label perturbations [50]. To avoid bias towards learning from easy examples, we randomly sample a subset of snippets for which we supervise with the pseudo labels. Our study of multiple pseudo ground truth generators shows that our simple model is competitive with the state-of-the-art. Furthermore, our iterative refinement process is generic and can be applied on top of more sophisticated models to further improve their performance.

Contributions: We summarize our contributions as 2-fold. (1) We introduce RefineLoc, an iterative refinement model for weakly-supervised temporal action localization. The model is crafted to leverage snippet-level pseudo ground truth to improve its performance over training iterations. (2) We show that RefineLoc’s iterative refinement process improves the performance of two state-of-the-art methods, setting a new state-of-the-art on THUMOS14.111To enable reproducibility and promote future research, we have released our source code and pretrained models on our project website.

2 Related Work

Action Recognition. The advent of action recognition datasets such as UCF-101 [58], Sports-1M [27], and Kinetics [28] has fueled the development of accurate action recognition models. Traditional approaches include extracting hand-crafted representations aimed at capturing spatiotemporal features [31, 61]; however, nowadays deep learning based approaches are more attractive due to their high capacity. For example, Simonyan and Zisserman [56] proposed to encode spatial and temporal information with Convolutional Neural Networks. Their two-stream model represents appearance with RGB frames and motion with stacked optical flow vectors. However, the two-stream model encodes each frame independently neglecting mid-level temporal information. To overcome this drawback, Wang et al. introduced the Temporal Segment Network (TSN) [63], an end-to-end framework that captures long-term temporal information. TSN along with other recent architectures (e.g. I3D [10] and C3D [60]) have become the defacto backbones for temporal action localization [47], action segmentation [19], and event captioning [65].

Fully-supervised Temporal Action Localization. Multiple strategies have been developed for temporal action localization with full-supervision available at training time [2, 14, 20, 22, 52, 68]. The first set of approaches used sliding windows combined with complex activity classifiers to detect actions [18, 43]. These methods paved the way for this type of research and established baselines and a reference for the difficulty of the problem. However, they manifested limitations regarding their run-time complexity. The second generation of methods used action proposals to speed up the search process [5, 6, 21, 34, 54]. These temporal proposals aim to narrow down the number of candidate segments the action classifier examines. A third generation of approaches learn action proposals and action classifiers jointly, while back-propagating through the video representation backbone [12, 64, 72]. Finally, recent methods make use of Graph Convolutional Networks by representing videos as graphs [70, 67]. Despite their significant performance improvements, all of these methods still rely on strong supervision that is prohibitively expensive to acquire.

Weakly-supervised Temporal Action Localization. The challenge in this task is to learn to discriminate between background and action segments without having explicit temporal training samples, but instead, only a coarse video-level label. The first methods proposed solutions consisting of hiding video regions to encourage their model to discover discriminative parts [30], and a soft-attention layer to focus on snippets that boost the video classification performance [62]. Similarly, [41] proposed an attention layer regularized with an L1 loss. Other works explored different alternatives such as co-activity loss combined with a multiple instance learning loss [46] and action proposal generation using contrast cues among action classification predictions [53, 36]. With the end goal of addressing the lack of temporal information, other works have innovated strategies such as incorporating temporal structure [69], modeling background [42, 33], using extra supervision (e.g. action count [39]), or single-frame label [37]. More recent methods have tried to reduce the supervision level by using self-supervised techniques [26]. Our work builds upon these ideas and complements them with a key insight: leveraging pseudo labels while iteratively training the model.

Weak Supervision and Pseudo-labeling in Vision Tasks. Weak supervision has been widely studied in other vision tasks such as object detection [3, 44, 51, 57], semantic segmentation [45, 66], or other video tasks [17, 23, 24, 48]. For video tasks, a variety of weak supervision cues have been used including movie scripts [32, 16, 29, 38], action ordering priors [4, 49, 15, 11], and different levels of supervision [13]. These video related solutions have proposed innovative ways to reduce labeling expense; however, they still require laborious annotations (e.g. action spots) or privileged information (e.g. transcripts) that is difficult to obtain beyond a controlled setting. Concerning pseudo-labeling, it has been used to design state-of-the-art methods for weakly-supervised object detection [59, 71], train image classification backbones [8, 9], and build pose detectors [40]. These works have inspired our model, which addresses challenges unique to the weakly supervised temporal action localization task, namely the presence of only a sparse supervision signal (video-level action category) and of highly similar context surrounding the action [1].

3 RefineLoc

In this section, we discuss our RefineLoc architecture, the pseudo ground truth label generation, and the iterative refinement process. The input to our model is an untrimmed video and the expected output is a set of action segment predictions. RefineLoc is supervised on weak labels (i.e. video-level action labels) and does not use any temporal annotations of action instances. RefineLoc has two main components: a weakly-supervised temporal action localization (WSTAL) base model (Subsection 3.1) and an iterative refinement process (Subsection 3.2). Based on a trained WSTAL model, we generate pseudo background-foreground ground truth labels. We use these pseudo labels to supervise the training of a new WSTAL model. We repeat the process for $\eta$ iterations to progressively improve the pseudo ground truth and refine the final action prediction segments. Figure 2 illustrates our approach.

3.1 WSTAL Base Model

The input to WSTAL is an untrimmed video, while the output is temporal action segment predictions. First, WSTAL extracts features form $T$ non-overlapping snippets, which are then fed into both a snippet-level action classifier and a background-foreground attention module. Then, WSTAL combines the class activation and attention maps to produce a video label prediction $\hat{\textbf{y}}$ . During training, we supervise WSTAL with a cross-entropy loss between the ground truth video label y and the predicted label $\hat{\textbf{y}}$ . Finally, we post-process the learned class activation and attention maps to produce action segment predictions. In what follows, we discuss the details of each module in WSTAL.

Feature Extraction Module. To compare with other works, we use two feature extractor backbones: TSN [63] (pretrained by UntrimmedNets [62]) and I3D [10] (pretrained on Kinetics [28]). We split the input untrimmed video into $T$ non-overlapping $H$ -frame-long clip snippets ( $15$ for TSN and $16$ for I3D). We transform each snippet into a $2048$ -dimensional feature vector by concatenating the two $1024$ -dimensional activation vectors from the global pooling layer of each stream. Thus, this module outputs a $T\times 2048$ feature map F.

Snippet-Level Classification Module. This module receives the feature map F and produces a $T\times N$ class activation map C, where $N$ is the number of action classes ( $100$ classes in ActivityNet v1.2 [7] and $20$ in THUMOS14 [25]). It consists of a multi-layer perceptron (MLP) with $L$ Fully-Connected (FC) layers interleaved with ReLU activation functions. We reduce the size of each hidden layer by $2$ , which makes the last layer of size $\frac{2048}{2^{L-1}}\times N$ .

Background-Foreground Attention Module. The objective of this module is to learn attention weights for each snippet to suppress the background snippets and to focus on foreground snippets. It transforms F into a $T\times 2$ background-foreground attention map A. Similar to the Snippet-Level Classification Module, it consists of an MLP with $L$ FC layers interleaved with ReLUs. Each hidden layer size is reduced by half, making the last FC layer of size $\frac{2048}{2^{L-1}}\times 2$ . Other weakly-supervised action localization methods [41, 42, 36, 53, 62, 39] employ attention modules in their models. While we share a similar motivation, our attention module is different from theirs in one key aspect: their attention modules are only supervised by the video-level label for the purpose of improving the video classification, while our attention is supervised by both the video-level label and a set of pseudo background-foreground labels with the goal of improving the action segment localization. Subsection 3.2 details the pseudo ground truth label generation process. Unlike previous methods with a scalar attention, we model the attention explicitly with two values, one for foreground and one for background. We chose to do so because our method uses supervision directly on the attention values. Thus, instead of learning the attention with a logistic-regression loss, we learn it as a binary classification problem. We compare learning a scalar attention via logistic regression against our proposed two dimensional attention in the supplementary material.

Video Label Prediction Module. This module combines C and A to generate an $N$ - dimensional probability vector $\hat{\textbf{y}}$ for the video label. Specifically, we pass C through a softmax layer across the class dimension to get $\bar{\textbf{C}}$ and pass A through two softmax layers. The first softmax layer operates across the background-foreground dimension to produce $\bar{\textbf{A}}^{bf}$ , while the second softmax layer operates across the time dimension (across snippets) of the foreground attentions in $\bar{\textbf{A}}^{bf}$ to produce $\bar{\textbf{A}}^{time}$ as follows:

[TABLE]

Here, we use $\bar{\textbf{A}}^{bf}$ as the network’s predictions for the snippet-level background-foreground pseudo ground truth supervision (Subsection 3.2). Note that in Equations 1 and 2, $i=1$ refers to background while $i=2$ refers to foreground. Finally, this module computes the video-label prediction as $\hat{\textbf{y}}=\sum_{t=1}^{T}(\bar{\textbf{A}}^{time}_{t}\cdot\bar{\textbf{C}}_{t})$ , where $\bar{\textbf{A}}_{t}^{time}$ and $\bar{\textbf{C}}_{t}$ are the foreground attention value and class activation vector of the $t^{\text{th}}$ snippet. The video-label prediction uses a soft attention mechanism to emphasize the class activations of snippets with higher attention values.

Action Segment Prediction Module. This module post-processes $\bar{\textbf{A}}^{bf}$ and $\bar{\textbf{C}}$ to produce a set of action segment predictions $\mathcal{P}$ . First, we filter out snippets for which the background attention value is greater than a threshold $\alpha_{A}$ . Then, we consider only the top- $k$ classes in $\hat{\textbf{y}}$ . For each top class $n$ , we filter out snippets that have classification score lower than a threshold $\alpha_{C}$ . Then, we generate contiguous segments by grouping snippets that are separated by at most one filtered-out (background) snippet. We do so to overcome noise in the filtering process and connect segments that are close to each other. This process can be done in other and more sophisticated ways, however we keep the simplicity of the base model and rely mainly on our iterative process. We assign to each predicted segment ( $t_{1}$ , $t_{2}$ ) the label $n$ and the score $s$ ,

[TABLE]

where $\hat{\textbf{y}}_{n}$ is the video-level predictions score for the $n^{\text{th}}$ class. Note that each prediction that comes from the $n^{\text{th}}$ top- $k$ labels, has a different score $s$ . Finally, to encode temporal context and deal with the ambiguity of action boundaries [1, 55], we inflate segments by $2$ snippets at both ends.

3.2 Iterative Refinement Process

Let $\mathcal{M}_{0}$ be the WSTAL base model trained using the weak video labels only. We iteratively refine this base model and its action predictions by introducing supervision on the background-foreground attention module using snippet-level pseudo ground truth labels. Let $\mathcal{G}^{\mathcal{M}_{\eta}}$ be the pseudo ground truth generation function that uses information from $\mathcal{M}_{\eta}$ (the trained WSTAL base model after iteration $\eta$ ) to map each snippet to a pseudo background-foreground label. At iteration $\eta+1$ , we train a new WSTAL base model $\mathcal{M}_{\eta+1}$ on the joint loss of the video-level label and the snippet-level pseudo ground truth labels from $\mathcal{G}^{\mathcal{M}_{\eta}}$ . Specifically, we compute the loss for $\mathcal{M}_{\eta+1}$ on a given video in the following way,

[TABLE]

where $\mathcal{L}$ is the cross-entropy loss and $\beta$ is a trade-off coefficient to balance the loss signal of the pseudo ground truth with that of the video label. Note that the second cross-entropy loss is class-weighted to alleviate the imbalance in background and foreground pseudo labels.

Pseudo Ground Truth Generation. Intuitively, to obtain the maximum gain from the iterative refinement process, we want a pseudo ground truth generator that provides the closest approximation to the true snippet-level background-foreground ground truth labels, i.e. it should minimize the mislabelling rate. In order to overcome any possible bias learned by the pseudo ground truth generator and inspired by [30], we only fixate on a portion of the pseudo ground truth in a process we call pseudo ground truth sampling: at the start of each refinement iteration, we randomly sample a percentage $S$ of snippets for which we apply the pseudo ground truth loss. We consider five different pseudo ground truth generation strategies and study their effects on the localization performance (Subsection 4.3).

(1) Uniformly Random Generator: This generator assigns a uniformly random pseudo label to each snippet.

(2) Distribution Aware Generator: This generator gives, with a biased probability, a random pseudo ground truth label to each snippet. The biased probability is equal to the average ratio of actual foreground to background snippets. This generator relies on information (namely the ratio) that requires access to strong temporal annotations. Thus, it does not align with the weakly-supervised setting, but we include it as a baseline reference only.

(3) Class Activation-Based Generator: This generator selects the pseudo ground truth label for a snippet $t$ by thresholding its maximum class score, $\max(\bar{\textbf{C}}_{t})$ .

(4) Attention-Based Generator: This generator produces the pseudo ground truth label for a snippet $t$ by thresholding its foreground attention value, $\bar{\textbf{A}}^{time}_{t}$ .

(5) Segment Prediction-Based Generator: This generator assigns pseudo labels based on the set of prediction segments $\mathcal{P}$ . A snippet is given a pseudo foreground label if it is covered by a segment prediction and a pseudo background label otherwise. We use this generator in our final model due to its attractive performance gain.

4 Experiments

4.1 Datasets and Evaluation Metric

We conduct our experiments on ActivityNet v1.2 [7] and THUMOS14 [25]. Both datasets consist of untrimmed videos with (weak) video-level action labels and have (strong) temporal annotations of action instances. However, we discard the strong annotations during training.

THUMOS14 [25]. This dataset has $1010$ validation and $1574$ testing videos annotated with $101$ sport-related action classes at the video-level. Among these videos, only $200$ validation and $213$ testing videos have temporal annotations for $20$ sport actions. As in prior work [20, 72], we only consider these $20$ classes, use the $200$ validation videos to train, and use the $213$ testing videos to evaluate performance.

ActivityNet v1.2 [7]. This dataset has $9682$ untrimmed videos annotated with $100$ activity classes. It is split into training, validation, and testing subsets, where the testing subset labels are withheld for an annual challenge. Following other methods [46, 53], we use the training subset ( $4819$ videos) to train and the validation subset ( $2383$ videos) to test the performance. ActivityNet is a challenging dataset due to its large-scale nature and, unlike THUMOS14, its diverse classes ranging from household activities to sports.

Evaluation Metric. We compare methods according to mean Average Precision (mAP). We report mAP at multiple temporal Intersection-over-Union (tIoU) thresholds. We take the average mAP across tIoU thresholds $0.5$ : $0.05$ : $0.95$ as the main metric for ActivityNet v1.2 and the mAP at tIoU threshold $0.5$ as the evaluation metric for THUMOS14.

4.2 Implementation Details

We extract features from two different architectures: an I3D model [10], and the same pre-trained TSN [63] model used in AutoLoc [53], with 16 and 15 number of frames per snippet ( $H$ ), respectively. We choose $L=2$ layers for the snippet-level classification and background-foreground attention modules. In the action segment prediction module, we set $(\alpha_{A},\alpha_{C})$ to $(0.5,0.005)$ for ActivityNet and $(0.5,0.35)$ for THUMOS14. We consider the top- $2$ labels when generating segment predictions in both datasets. At every iteration, we randomly sample $S=80\%$ of the pseudo labels. Finally, we use an initial learning rate of $10^{-4}$ for ActivityNet and $10^{-3}$ for THUMOS14, and decay the learning rate by $0.9$ when the validation loss saturates. We train for $50$ epochs per refinement iteration and pick the best model with the lowest validation loss from Equation 4.

4.3 Ablation Study

In this subsection, we present multiple ablation studies motivating the design choices for our RefineLoc approach. First, we study the performance of several pseudo ground truth generators and the influence of the loss trade-off coefficient $\beta$ (Equation 4) on the performance of each generator. Afterwards, we analyze how our model’s performance changes from one refinement iteration to the next. Finally, we present a diagnosis study (using the DETAD [1] diagnostic tool) of the detection results before and after our iterative refinement process. We present all the studies in this subsection using ActivityNet v1.2 [7] dataset along with I3D features. For all the experiments in this section we report average mAP at tIoU thresholds $0.5$ : $0.05$ : $0.95$ . Refer to the supplementary material for the study results on ActivityNet v1.2 using TSN features as well as on THUMOS14 [25] using I3D and TSN features.

Effects of the Pseudo Ground Truth Generator and the Loss Trade-off Coefficient $\beta$ . Table 1 summarizes the best average mAP performance using the five generators with five $\beta$ values. The baseline model $\mathcal{M}_{0}$ ( $\beta=0$ ) achieves $9.66\%$ average mAP at tIoU= $0.5$ : $0.05$ : $0.95$ . We observe a performance improvement over $\mathcal{M}_{0}$ across all generator types and $\beta$ values. This shows the effectiveness of our iterative refinement process. Moreover, we observe that the segment prediction-based generator is the best among the five generators. We hypothesize that this generator is better, since it has access to information from both the class activation and attention maps. Moreover, $\beta=4$ strikes the best balance between the video label loss and the background-foreground pseudo ground truth loss. We observe similar results on THUMOS14: the best generator is the segment prediction-based one and the best $\beta$ is $4$ .

Performance over Refinement Iterations. Table 2 shows the evolution of RefineLoc’s performance across five refinement iterations. We obtain the highest performance (average mAP of $23.24\%$ ) after $\eta=3$ iterations. This is a significant $13.64\%$ increase over our baseline model $\mathcal{M}_{0}$ (iteration [math] in the table). We also see that refining $\mathcal{M}_{0}$ for a single iteration boosts the performance by $9.48\%$ . This clearly shows the effectiveness of leveraging the pseudo ground truth labels during training. We observe similar results on THUMOS14: the best performance is achieved after $\eta=3$ refinement iterations.

4.4 State-of-the-Art Comparison and Generalizability

On ActivityNet v1.2 [7] (Table 3). RefineLoc with TSN features outperforms state-of-the-art, CleanNet [36], by $1.6\%$ in average mAP (Table 3(a)), while RefineLoc with I3D features shows competitive performance to BaS-Net [33] (Table 3(b)). ActivityNet is large-scale and contains more diverse classes compared to THUMOS14. Thus, RefineLoc’s strong performance on ActivityNet shows the effectiveness of our iterative refinement approach. We observe that our refinement process significantly enhances our base model, i.e. RefineLoc ( $\eta=0$ ), by $9.9\%$ (TSN) and $13.5\%$ (I3D) in average mAP.

On THUMOS14 [25] (Table 4). RefineLoc with TSN features (Table 4(a)) and with I3D features (Table 4(b)) exhibits competitive performance to state-of-the-art methods [33, 36, 42]. We observe that our refinement process significantly enhances our base model, i.e. RefineLoc ( $\eta=0$ ), by $19.7\%$ (TSN) and $3.6\%$ (I3D) in mAP@tIoU $=0.5$ .

On Generalizability (Table 5). We chose our WSTAL-base model to be simple, compared to other state-of-the-art models, to highlight the main contribution of our work, i.e. the iterative refinement process. This process can lift the performance of such a simple model to compete and even outperform state-of-the-art methods on both datasets. Moreover, effectiveness of the refinement process is independent of the WSTAL-base model, which we demonstrate by generalizing our framework to other base models, namely W-TALC [46] and BaS-Net [33], on THUMOS14 using I3D features. These two methods employ attention-based models, where we apply our pseudo-background-foreground ground truth refinement process. Table 5 compares the results from the released code of the two methods vs. their performance after adding our iterative refinement process. By doing this, we significantly improve both base methods. In fact, BaS-Net is improved by $1.77\%$ in mAP@tIoU $=0.5$ , setting a new state-of-the-art performance on THUMOS14 ( $28.03\%$ mAP@tIoU $=0.5$ ). Its important to note that the numbers obtained with the released codes differ from the ones reported in [46, 33].

We show that our method is simple, yet effective. We demonstrate that the key component of RefineLoc is the iterative process, showing its effectiveness regardless dataset, features, or base model. Despite its simplicity, RefineLoc outperforms all other methods using TSN features on ActivityNet, and beats the state-of-the-art when using BasNet and W-TALC as base models on THUMOS14.

4.5 Error Analysis and Qualitative Results

Diagnosing Detection Results. To analyze the merits of the proposed refinement strategy, we conduct a DETAD [1] false-positive analysis of RefineLoc at refinement iterations 0 and 3. We present the results in Figure 3. The false-positive profile analysis provides a fine-grained categorization of false-positive errors and summarizes the distribution of these errors over the top $5\textit{G}$ model predictions, where G is the number of ground truth segments in the dataset. After refinement (right plot), we observe that RefineLoc generates more high-scoring true positive predictions (towards $1\textit{G}$ ). Despite the reduction of background and localization errors, there is an increase in confusion errors. We explain this increase due to the simplicity of our initial classification module. Besides, the extra supervision generated by the pseudo-ground truth encourage the model to improve the localization but not directly the label prediction.

Qualitative Results. Figure 4 shows some RefineLoc qualitative detection results on ActivityNet. We present results for three different videos across different refinement iterations. The top video shows our method not only enhances its coverage over iterations, but is also able to detect a new instance at iteration $1$ that was missed in the previous iteration. In the middle video, we see how RefineLoc manages to successfully merge different predictions over iterations. We also see erroneous predictions being cut off from iteration to iteration. The final example shows a failure case. Despite starting with decent predictions at iteration [math], our predictions diverge drastically in subsequent steps.

5 Conclusion

We have presented RefineLoc, a novel weakly-supervised temporal action localization method. RefineLoc uses an iterative refinement strategy, where snippet-level pseudo labels are generated and used at every training iteration. Our experiments have shown that RefineLoc is competitive with the state-of-the-art and that our general iterative refinement process boosts the results of other methods outperforming the state-of-the-art, suggesting that it could be used as an off-the-shelf strategy to refine results of future weakly-supervised methods for temporal action localization. As labeling videos for action localization is a massive time and cost bottleneck, RefineLoc takes a step closer to alleviating the need for these prohibitively expensive tasks.

Acknowledgments. This work is supported the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-CRG2017-3405.

Supplementary Material

Appendix A Additional Ablation Study

Here, we include the same ablation study presented in the main paper (Subsection 4.3) for three additional settings: ActivityNet v1.2 [7] using TSN [63] features and THUMOS14 [25] using TSN and I3D [10] features.

Effects of the Pseudo Ground Truth Generator and the Loss Trade-off Coefficient $\beta$ . Tables 6(a), 6(c), and 6(b) summarize the best performance for the five generators and for five different $\beta$ values on ActivityNet v1.2 using TSN, THUMOS14 using I3D, and THUMOS14 using TSN, respectively. The Segment Prediction-Based Generator consistently gives the best performance gain compared to the other generators in all settings.

Performance over Refinement Iterations. Tables 7(a), 7(b), and 7(c) show the evolution of RefineLoc’s performance across refinement iterations on ActivityNet v1.2 using TSN, THUMOS14 using I3D, and THUMOS14 using TSN. In each setting, we consistently observe a significant performance increase over our baseline model $\mathcal{M}_{0}$ (iteration [math] in each table).

Diagnosing Detection Results. To further analyze the merits of the proposed refinement strategy, we conduct a DETAD [1] false positive analysis of RefineLoc on ActivityNet v1.2 and THUMOS14 using I3D and TSN (Figures 5(a), 5(b), 5(d), and 5(c)). The false-positive profile analysis provides a fine-grained categorization of false-positive errors and summarizes the distribution of these errors over the top $5\textit{G}$ model predictions, where G is the number of ground truth segments in the dataset. After refinement (right plot in each figure), we observe that RefineLoc generates more high-scoring true positive predictions (towards $1\textit{G}$ ) and reduces background and localization errors. The DETAD results indicate that our iterative refinement encourages tighter temporal predictions, which we argue does occur primarily because of the snippet-level supervision injected in the form of pseudo ground truth.

Appendix B Logistic Regression vs Cross-Entropy

RefineLoc learns two values for the attention, instead of learning one single scalar. The motivation behind this design choice is to learn explicitly one value for background attention and one value for foreground attention. Besides, learning these two values trough a classification loss ( $\emph{i.e}.\hbox{}$ cross-entropy) is an easier problem than learning one value through a regression loss ( $\emph{i.e}.\hbox{}$ logistic regression). For ActivityNet, we found that our initial hypothesis is true. Indeed, when we learn only one scalar for attention, RefineLoc obtains only $22.2\%$ average mAP using I3D features, a $1\%$ drop in average mAP compared to the results obtained with cross-entropy. In contrast, the best result on THUMOS14 is obtained by learning only one scalar value. When learning two values for attention with cross-entropy, our model obtains only $19.95\%$ mAP at tIoU $0.5$ .

B.1 Qualitative Results

ActivityNet v1.2. Figure 6 shows some RefineLoc qualitative detection results on ActivityNet. We present results across different refinement iterations. The top video shows our method not only enhances its coverage over iterations, but it is also able to detect a new instance at iteration $1$ that was missed in the previous iteration. In the middle video, we see how RefineLoc manages to successfully merge different predictions over iterations. We also see erroneous predictions being cut off from iteration to iteration. The final example shows a failure case. Despite the starting point at iteration [math], our predictions diverge in later steps. We believe this confusion comes from the heavy context around the actions.

THUMOS14. Figure 7 showcases RefineLoc qualitative results from the THUMOS14 dataset. We present results for three different videos over mulitple refinement iterations. The top video shows our method not only enhances its coverage over iterations, but it is also able to detect a new instance at iteration $1$ that was missed in the previous iteration. In the middle video, we see how RefineLoc manages to successfully cut off erroneous predictions from iteration to iteration. The final example shows a failure case. Despite starting with decent predictions at iteration [math], our predictions do not improve in subsequent steps. We believe this confusion comes from the heavy context around the actions.

Bibliography72

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In ECCV , 2018.
2[2] Humam Alwassel, Fabian Caba Heilbron, and Bernard Ghanem. Action search: Spotting targets in videos and its application to temporal action localization. In ECCV , 2018.
3[3] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In CVPR , 2016.
4[4] Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. Weakly supervised action labeling in videos under ordering constraints. In ECCV , 2014.
5[5] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In CVPR , 2017.
6[6] Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR , 2016.
7[7] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR , 2015.
8[8] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV , 2018.