Spatio-Temporal Action Localization in a Weakly Supervised Setting

Kurt Degiorgio; Fabio Cuzzolin

arXiv:1905.02171·cs.CV·May 7, 2019

Spatio-Temporal Action Localization in a Weakly Supervised Setting

Kurt Degiorgio, Fabio Cuzzolin

PDF

Open Access

TL;DR

This paper introduces a weakly supervised method for spatio-temporal action localization in videos, using unsupervised segmentation, CNN feature extraction, and a novel MIL formulation with continuous labels to reduce annotation requirements.

Contribution

It proposes a new MIL approach with continuous labels and set splitting regularization for weakly supervised action detection in videos.

Findings

01

Effective on UCF-Sports dataset

02

Outperforms some existing weakly supervised methods

03

Demonstrates robustness with limited annotations

Abstract

Enabling computational systems with the ability to localize actions in video-based content has manifold applications. Traditionally, such a problem is approached in a fully-supervised setting where video-clips with complete frame-by-frame annotations around the actions of interest are provided for training. However, the data requirements needed to achieve adequate generalization in this setting is prohibitive. In this work, we circumvent this issue by casting the problem in a weakly supervised setting, i.e., by considering videos as labelled `sets' of unlabelled video segments. Firstly, we apply unsupervised segmentation to take advantage of the elementary structure of each video. Subsequently, a convolutional neural network is used to extract RGB features from the resulting video segments. Finally, Multiple Instance Learning (MIL) is employed to predict labels at the video segment…

Tables4

Table 1. Table 1: Intersection Over Union ( IOU ) for UCF-Sports.

Action Class	ADT [30]	ATM [12]	HVC [10]
Diving-Side	0.583706	0.363168	0.452065
Golf-Swing-Back	0.801282	0.748542	0.436035
Golf-Swing-Front	0.435979	0.526266	0.533355
Golf-Swing-Side	0.724060	0.727153	0.461906
Kicking-Front	0.696297	0.550503	0.337375
Kicking-Side	0.691274	0.484475	0.498129
Riding-Horse	0.542712	0.247956	0.467046
Run-Side	0.588205	0.492336	0.351405
SkateBoarding-Front	0.659791	0.561727	0.319265
Swing-Bench	0.683738	0.464019	0.400841
Swing-SideAngle	0.529371	0.368790	0.359946
Walk-Front	0.713729	0.640201	0.425863
Mean IOU	0.637512	0.514595	0.420269
Mean #Proposals	1798	254	312

Table 2. Table 2: mAP scores for UCF-Sports ( mAP / AP @ 0.2).

CLASS	PMIL	PMIL+F	PMIL+F+S
kicking	64.00	64.00	64.00
golf-swing	79.46	79.46	79.46
diving	100.00	100.00	100.00
riding-horse	12.50	12.50	50.00
running	0	42.00	50.01
skate-boarding	0	43.01	50.08
swing-bench	100.00	100.00	100.00
swing-side	0.00	56.25	56.25
walking	26.84	40.16	40.16
mAP	43.00	60.00	67.00

Table 3. Table 3: mSERO scores for UCF-Sports.

CLASS	PMIL	PMIL+F	PMIL+F+S
kicking	4.410	3.771	3.771
golf-swing	4.868	4.606	4.606
diving	0.429	1.216	1.216
riding-horse	3.323	3.456	1.523
running	10.152	10.180	0.659
skate-boarding	12.228	12.228	12.228
swing-bench	5.450	4.644	4.644
swing-side	11.082	9.456	9.456
walking	8.785	6.969	6.969

Table 4. Table 4: UCF-Sports result comparison.

Class	Ours	[28]
kicking	64.00	_
golf-swing	79.46	_
diving	100.00	_
riding-horse	50.00	_
running	50.08	_
skate-boarding	50.00	_
swing-bench	100	_
swing-side	56.25	_
walking	40.16	_
mAP	67.00	61.20

Equations20

i ⋃ V_{i} = Λ^{3} and V_{i} \cap V_{j} = \emptyset \forall i, j .

i ⋃ V_{i} = Λ^{3} and V_{i} \cap V_{j} = \emptyset \forall i, j .

\mathcal{D}_{a}=\Big{\{}(X_{1},Y_{1,a}),\dots,(X_{N},Y_{N,a})\Big{\}},

\mathcal{D}_{a}=\Big{\{}(X_{1},Y_{1,a}),\dots,(X_{N},Y_{N,a})\Big{\}},

Y_{i, a} = j max y_{i, j, a} .

Y_{i, a} = j max y_{i, j, a} .

f_{a} (x_{i, j}) \to y_{i, j, a} .

f_{a} (x_{i, j}) \to y_{i, j, a} .

p_{i, j, a} = P r (y_{i, j, a} = 1∣ x_{i, j}; w_{a}) = \frac{1}{1 + exp ( - w _{a}^{T} x _{i, j} )} .

p_{i, j, a} = P r (y_{i, j, a} = 1∣ x_{i, j}; w_{a}) = \frac{1}{1 + exp ( - w _{a}^{T} x _{i, j} )} .

P_{i, a} = P r (Y_{i, a} = 1∣ X_{i}; w_{a}) = 1 - j = 1 \prod J (1 - p_{i, j, a}) .

P_{i, a} = P r (Y_{i, a} = 1∣ X_{i}; w_{a}) = 1 - j = 1 \prod J (1 - p_{i, j, a}) .

ar g w_{a} min [\frac{λ}{2} ∣∣ w_{a} ∣ ∣^{2} + \frac{1}{J N} i = 1 \sum N (J β α_{X_{i}} + j = 1 \sum J α_{x_{i, j}})],

ar g w_{a} min [\frac{λ}{2} ∣∣ w_{a} ∣ ∣^{2} + \frac{1}{J N} i = 1 \sum N (J β α_{X_{i}} + j = 1 \sum J α_{x_{i, j}})],

\mathcal{\alpha}_{X_{i}}=-\Big{[}Y_{i,a}\log P_{i,a}+(1-Y_{i})\log(1-P_{i,a})\Big{]}

\mathcal{\alpha}_{X_{i}}=-\Big{[}Y_{i,a}\log P_{i,a}+(1-Y_{i})\log(1-P_{i,a})\Big{]}

\mathcal{\alpha}_{x_{i,j}}=\max\Big{(}0,\eta-\text{sign}(p_{i,j}-\zeta)(\mathbf{w}_{a}^{T}{x}_{i,j})\Big{)}

\mathcal{\alpha}_{x_{i,j}}=\max\Big{(}0,\eta-\text{sign}(p_{i,j}-\zeta)(\mathbf{w}_{a}^{T}{x}_{i,j})\Big{)}

\textsc m S E R O_{a} = \frac{1}{N} i = 1 \sum N (k_{i, a} - j max p_{i, j, a})^{2},

\textsc m S E R O_{a} = \frac{1}{N} i = 1 \sum N (k_{i, a} - j max p_{i, j, a})^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization

Full text

Spatio-Temporal Action Localization in a Weakly Supervised Setting

Kurt Degiorgio

Oxford Brookes University

[email protected]

Fabio Cuzzolin

Oxford Brookes University

[email protected]

Abstract

Enabling computational systems with the ability to localize actions in video-based content has manifold applications. Traditionally, such a problem is approached in a fully-supervised setting where video-clips with complete frame-by-frame annotations around the actions of interest are provided for training. However, the data requirements needed to achieve adequate generalization in this setting is prohibitive. In this work, we circumvent this issue by casting the problem in a weakly supervised setting, i.e., by considering videos as labelled ‘sets’ of unlabelled video segments. Firstly, we apply unsupervised segmentation to take advantage of the elementary structure of each video. Subsequently, a convolutional neural network is used to extract RGB features from the resulting video segments. Finally, Multiple Instance Learning (MIL) is employed to predict labels at the video segment level, thus inherently performing spatio-temporal action detection. In contrast to previous work, we make use of a different MIL formulation in which the label of each video segment is continuous rather then discrete, making the resulting optimization function tractable. Additionally, we utilize a set splitting technique for regularization. Experimental results considering multiple performance indicators on the UCF-Sports data-set support the effectiveness of our approach.

1 Introduction

Localising and classifying human actions in video-clips is a hard problem. This can be attributed to the sheer variety of different scenarios under which such actions may be performed. Traditionally a fully-supervised paradigm is used to tackle both the localization and classification of human actions. This is problematic, especially in a deep learning setting, as it is intractable to manually label spatio-temporal actions in millions of videos. Tools that automate the collection and annotation of data are frequently used to alleviate this problem [31, 3]. While effective, such tools still rely on human intervention, limiting the range of action classes that can be learned. Recently, researchers have formulated the aforementioned problem under a Weakly Supervised Learning (WSL) [22, 15, 16, 11] setting, where the extensive data annotation requirements demanded in a fully supervised setting is significantly alleviated. Specifically, WSL only requires one label designating the presence of a given action per video, rather than one label for every video frame or segment. Under weakly labelled conditions, classification functions must be learned by correlating negative and positive video instances of a given action. The process is naturally more challenging than in a fully supervised setting, as the decision of which function to learn is more ambiguous. Should a learner recognize instances were a human is ‘kicking a ball’? or should it detect ‘kicking’ in general, regardless of what is being kicked or who is doing the ‘kicking’? In WSL, the learner will make this choice according to what is present in the extracted features. More specifically, if a majority of the positive videos depict a human body kicking a ball, then this is what the model will learn. This can be problematic as there are no labels to guide the algorithm towards the desired function, the one mapping the ‘kicking’ action to the ‘kicking’ label. It may instead learn a function that maps the label – ‘kicking’– to an artifact that is, by mere accident, prevalent in all the positive videos in the training set. Such issues notwithstanding, significant progress has been made.

The training algorithm used by the vast majority of previous related works is a formulation of Multiple Instance Learning (MIL). While effective, MIL leads to a mixed integer programming problem that has to be solved heuristically. Moreover, formulations under this setting make it difficult to integrate prior information.

1.1 Contribution

For these reasons in this paper we adopt a different approach that allows the latent label of each proposal to be continuous instead of discrete. Additionally, we make use of a video-splitting technique aimed to reduce ambiguity between proposals, thus ameliorating localization performance. With regards to proposal generation (a crucial step in MIL) we evaluate three different techniques that aim to generate video segments in the form of temporally-consistent action-tubes [9].

Subsequently, we take advantage of a standard Convolutional Neural Network (CNN) architecture to extract deep features from each action tube.

To the best of our knowledge, this holistic approach of using relaxed MIL constraints in conjunction with video splitting is a novel method for localizing spatio-temporal actions under weakly labelled conditions.

1.2 Paper outline

The remaining sections of this paper are structured as follows. Section II gives a concise overview of the relevant literature. Sections III and IV formalise the proposed methodology and the experimental setup, respectively. The paper concludes by presenting and discussing the observed empirical results on the UCF-Sports data-set.

2 Related Work

Three distinct problems need to be tackled in order to successfully localize spatio-temporal actions in test videos: (i) feature extraction, (ii) video segment representation, and (iii) the design of algorithms that can learn to generalize over the observed features.

The majority of state-of-the-art methods in video based action localisation employ CNNs along with a temporal association algorithm for localizing actions in a fully-supervised setting [9, 34, 13, 24, 20, 2, 21, 23]. Such approaches demand per-frame bounding box annotation, which is expensive, and are hence restricted to relatively small data-sets with a limited number of action classes. This has instigated a need for a framework that can leverage the descriptive power of CNNs without the expensive space-time annotations that is demanded by fully-supervised learning. Relevant research is being conducted for object detection in images [25, 26, 36, 6, 4, 18], but the topic is still surprisingly rather unexplored in the video domain.

Traditionally, Bag-of-features (BOF) encoding has been the representation technique of choice. This method clusters space-time features to build a visual vocabulary from the entire training set of videos. Features are commonly extracted based on shape (HOG, etc.) or motion (e.g., optical flow) [17]. Sapienza et al. [22] have shown that this approach is inherently flawed, since the resulting feature set is not sufficiently descriptive of the action class in question. In an attempt to address this issue, [22] divides each video into a number of sub-volumes determined by a rigged spatio-temporal grid. BOF or Fisher vectors are employed to describe each sub-volume by extracting and grouping features through a technique developed by [32], termed Dense Trajectory Features. The latter is a significant improvement over previous, and now largely outdated, approaches based on interest point detectors (e.g. 3D-SIFT, HOG3D, SURF, and so on). After associating a video with a ‘bag’ of video sub-volumes (a step termed proposal generation), [22] casts the localization problem in a weakly supervised learning framework, to select sub-volumes that best characterize the action class under consideration. Subdividing the video clip into sub-volumes and learning solely from them allows them to mitigate the issue with extraneous features. While effective, this approach is still not optimal as the grid is rigid, and cannot be made too fine because of complexity issues. For this reason, pixel-wise video segmentation, such as the one utilized by [28], has the potential to be more effective. Furthermore, deep learning has shown that learning representations automatically through the use of Deep Neural Networks can outclass all (‘handcrafted’) representation techniques based on manual feature engineering [8, 9].

Following this intuition, Tang et al. [28] annotate actions using a pixel mask by exploiting a video segmentation method conceptualised by [10] and [7]. Inter-class information is leveraged to train a WSL algorithm that is capable of localizing actions. [27] adopts a markedly distinctive approach. Specifically, a large number of spatio-temporal action proposals are generated using dense trajectory features. This set is significantly trimmed by exploiting motion and saliency-based information. Subsequently, a graph is constructed between proposals across different videos and actions are localized by finding maximum cliques in the resulting graph.

The approach proposed in this paper is most similar to that of [28]. The novelty element in our work is the fact that we use a probabilistic MIL formulation to generalize over new features, in conjunction with a video splitting technique applied at training time, that allows us to reduce ambiguity and ameliorate generalization performance.

3 Methodology

The proposed action localization methodology is designed to be hierarchical. Specifically, we first take advantage of one of three different techniques to generate temporally consistent video segment proposals, in the form of action tubes. Next, a CNN is used to extract deep features from each action tube. For sake of clarity and generality, once an action tube is vectorised by the CNN, it is referred to as a ‘proposal’. Similarly, the originating video is called a ‘set’ (sometimes the term ‘bag’ is used in the WSL literature). Generally, each single ‘set’ is composed by several proposals.

Given ‘sets’ of proposals derived from the training videos, MIL takes care of selecting (i.e., localizing) for each test set a proposal that best characterizes the action the model represents. This is done by training a probabilistic model for every action class under a probabilistic MIL formulation, thus taking full advantage of the weakly supervised learning setting.

At prediction time the learned models are employed to localize proposals within the test sets (video-clips), after which one-vs-all classification, using all the learnt class-specific models, is used to infer the set-level label. This label essentially associates a set (i.e., a video-clip) with a specific action class.

Figure 1 provides an overview of this approach.

3.1 Generating Action Tube Proposals

The generation of action tubes from videos can be formally defined by a mapping $F:V\rightarrow\Lambda^{3}$ , where $V$ represents a video and $\Lambda^{3}$ is a 3D lattice associated with that video. Conversely, a supervoxel is a subset $\mathcal{V}$ of $\Lambda^{3}$ representing a group of connected and/or perceptually similar pixels.

Given an input video $V$ , video segmentation produces a set of supervoxels $\{\mathcal{V}_{i}\subset\Lambda^{3},i\}$ such that:

[TABLE]

In particular, the video segmentation algorithm proposed by [10] first builds a 3D graph from the entire video volume by using image segmentation methods to get an initial over-segmented version of each frame. Subsequently, dense optical flow is used to slice the structure of the graph along the temporal dimension. The author makes use of a hierarchical scheme to recursively re-segment the over-segmented frames. This approach enables the algorithm to achieve spatial and temporal cohesive supervoxels even in videos of longer duration. While effective, this algorithm has drawbacks. Firstly, the algorithm does not provide a method for selecting an optimal hierarchy level (i.e., when should we stop segmenting?). Secondly, the supervoxels that represent human actions still tend to be over-segmented even at the higher levels of the hierarchy.

To addresses the first issue [37] develops a method that makes use of an ‘objectiveness’ measure to select the hierarchy level that yields the best spatial and temporary consistent supervoxels. The work of [35] further improves on this idea by developing an algorithm that joins broken supervoxels using a selective search approach.

In this work we evaluate two spatio-temporal video segmentation methods for proposal generation: ‘Efficient hierarchical graph-based video segmentation (HVC)’ [10] and ‘Action localization with tubelets from motion (ATM)’ [12]. Additionally, we evaluate a recent non-segmentation-based proposal generation method, namely ‘Action localization proposals from dense trajectories (ADT)’ [30].

3.2 Proposal Representation

The CNN from [5] is used in this work to map each proposal to a feature vector. This particular CNN has 16 layers from which fc7 features are extracted for every frame in every action tube. The resulting feature vectors are averaged across all the frames of the constituting action tube and then normalized. This process results in $4096$ feature components per action tube. One side effect of our approach is that since frames are fed to the network separately, no motion-related information is extracted. Neural networks that can learn from an entire video sequence are becoming very popular [29]: their adoption would provide a natural continuation of this work.

3.3 Weakly Supervised Learning - Training

Consider a finite set of actions $a\in\mathcal{A}$ , each representing a specific action class (e.g., ‘kicking’). For each action class $a$ we define:

[TABLE]

where $X_{i}$ , $i\in\{1,\dots,N\}$ , represents a single ‘set’ (i.e., a single video) and $Y_{i,a}\in\{0,1\}$ is the associated binary set-level label (either the selected action class is present, or it is not). Every set $X_{i}$ is composed of $j$ proposals (namely, vectorised action tubes) $x_{i,j}\in X_{i}$ , where $j\in\{1,\dots,J\}$ . Initially, $Y_{i,a}$ is set to $1$ if $X_{i}$ contains a proposal that depicts $a$ , to [math] otherwise.

Each $x_{ij}$ is associated with a latent variable $y_{i,j,a}$ , which represents the (unknown) label of the corresponding proposal (video segment). Under the classical MIL assumptions:

[TABLE]

The objective is thus to find, for each action class $a$ , a function providing the value of the latent variable for each proposal, namely:

[TABLE]

[1] formulates the MIL objective function as an instance max-margin problem. This leads to a mixed integer programming problem that can only be solved heuristically.

Additionally, using this formulation prior knowledge cannot be easily incorporated.

For these reasons, here we follow [33] in-order to relax the MIL constraints such that the latent variable of each proposal is allowed to assume continuous values rather than discrete ones, namely: $y_{i,j,a}\in[0,1]$ .

The probability, $p_{i,j,a}$ that a specific proposal (action tube) in a set (video-clip) belongs to a given action class is modelled by a logistic function as follows:

[TABLE]

The probability that set $X_{i}$ contains at least one proposal depicting $a$ is therefore:

[TABLE]

This completely encapsulates the MIL constraints where a set, $X_{i}$ , is positive only iff at least one proposal, $x_{i,j}$ , is positive. Conversely, a set is negative if all instances are negative. The overall optimisation problem is:

[TABLE]

where

[TABLE]

is the cost function for $X_{i}$ , and

[TABLE]

is the cost function for $x_{i,j}$ .

The following parameters are involved:

•

$\lambda$ is the learning rate.

•

$\beta$ is a regularization parameter for weight decay.

•

$\eta$ is the margin parameter, that separates positives proposals from negative proposals.

•

$\zeta$ is a threshold parameter that determines whether a proposal should be considered as positive or negative.

Stochastic gradient descent is to used to solve equation (4), this dictates the introduction of an additional hyper-parameter $\pi$ that controls the number of iterations per set.

Algorithm 1 and Algorithm 2 showcase a pseudo-code implementation of the training function. Note, $\omega$ is exclusively used for set-splitting (discussed below). $\%$ denotes the modulus operator.

3.3.1 Set Splitting

MIL captures factors that explain the statistical variations between negative and positive sets. In essence, finding factors of variations that are sufficiently significant to explain the observed difference. Naturally, this makes it difficult to bias MIL formulations towards learning functions that represent only the class of interest and nothing more. Consequently, MIL tends to perform poorly on so called ‘hard positives’ or ‘hard negatives’. These are noisy proposals, where determining if the proposal in question is representing the action of interest or not is hard. Geometrically, such proposals lie preciously close to the decision boundaries that separate positives from negatives. For example, differentiating between ‘running’ and ‘walking’ actions.

To mitigate the aforementioned issue, we adopt a set splitting technique that leverages the probabilistic formulation of MIL to repeatedly split each set into negatives and positives during training. Specifically, on every epoch, the proposals $x_{i,j}$ in $X_{i}$ are sorted according to $p_{i,j,a}$ , in descending order. Then the top- $\omega$ proposals $x_{i,j}$ are considered to be positive instances of class $a$ , while the rest are deemed to be negative. A new set $X_{i+1}$ is then created to host them (see Algorithm 3).

3.4 Weakly Supervised Learning - Predictor

Given a model $f(X_{i};\mathbf{w}_{a};{b}_{a})$ representing a single action class $a$ , proposal probabilities are given by equation (2). Each probability, $p_{i,j,a}$ is interpreted as a measure of how likely it is for a given proposal, $x_{i,j,a}$ to depict action class $a$ . This answers the localisation problem, where the proposal with the highest probability is considered to be the most likely proposal to depict $a$ in $X_{i}$ .

The probability that set $X_{i}$ contains a proposal that depicts $a$ is derived from proposal probabilities by equation (3). Set probabilities are used to resolve the classification problem where we would like to associate each set (video-clip) with a single action-class. For this purpose, one-vs-all classification is used.

Algorithm 4 outlines the prediction function. This predicts both set and proposal probabilities for every new observation, with every trained action class model.

4 Experimental Setup

4.1 Benchmark and Protocols

The UCF-Sports data-set is used to evaluate our approach. It consists of one hundred and fifty videos divided into ten separate action classes. Whereas this data-set comes with both set (video) and proposal (frame) level annotations, our approach does not require proposal-level annotations and as such they are only used for evaluation purposes. With regards to the train-test split we follow the approach of [14] where forty three videos are used for testing and hundred and three videos for training. For cross-validation a slightly modified version of the leave-one-out method is used. Namely, instead of leaving only one observation out, we leave one out for every class while still guaranteeing that every class is in the training set at least once. We make use of a standard grid-search for selecting hyper-parameter values, optimizing only for set-level classification accuracy.

As for proposal generation, we evaluate and compare three algorithms (ATM, HVC, ADT), from which we select one for the subsequent stages.

4.2 Experiments

After extracting features using the selected proposal generation algorithm we train a probabilistic model for every action class, as described above. We set up three different experiments:

•

PMIL: baseline probabilistic MIL;

•

PMIL+F: probabilistic MIL with filtering, where exceedingly large proposals are filtered-out;

•

PMIL+F+S: probabilistic MIL with filtering and splitting.

For each experiment we train a new set of models and evaluate and compare both set-level classification and localization performance. In general, an optimal model should pick a proposal that perfectly encapsulates the action it represents while simultaneously providing accurate set-level classification performance.

4.2.1 Performance Indicators

We employ the Intersection Over Union (IOU) metric to evaluate the quality of the generated proposals with respect to the ground-truth. To evaluate set-level classification and localization performance we follow the lead of previous work [22, 28] by using Mean Average Precision (mAP). For the purposes of calculating these scores, as in previous work we consider any proposal that covers at least 20% of the ground truth to be correct.

In contrast to other work, we measure the localization performance without considering classification accuracy. This metric is referred to as Mean-Squared Error with Respect to Optimal Choice or in short, mSERO.

This score is given, for each action class $a$ by:

[TABLE]

where $k_{i,a}$ is the probability assigned to the optimal choice (best possible proposal as measured by IOU). According to this metric an optimal model will always pick the proposal with the best IOU, whereas a sub-optimal model will not. This measure quantifies how far off a model was from predicting the best possible proposal as generated by the first stage of our pipeline.

5 Results

5.1 Proposal Generation

Table 1 reports the IOU score of the best action tube for every action class, averaged across all the videos in that class. Figure 4 depicts the Recall-IOU curve at various thresholds. Figure 3 illustrates a single frame from the ‘horse-riding’ action-class in UCF-Sports rendered with the generated action tube proposals alongside the ground-truth.

The results clearly show how the ADT algorithm achieves an overall better IOU then the other algorithms. However, as it can be observed in Figure 3, it also generates significantly more proposals. In fact ADT generates approximately seven proposals for each proposal generated by ATM or HVC. An excessive number of proposals per video is not desirable, since it makes the localization task harder, negatively effecting generalization performance. Primarily for this reason we decided to use the proposals generated by the ATM algorithm for feature extraction. ATM provides a reasonable balance between the number of proposals generated and the quality of the resulting video segments.

5.2 Classification and Localization

Table 2 lists the mAP scores for all action classes generated in the aforementioned experiments.

One can observe that filtering the data to exclude proposals that have high intersection with the entire video volume delivers significant improvements (PMIL+F). The reason can be attributed to the fact that large proposals tend to be ‘hard negatives’, implying that the learner does not have an incentive to pick a tight-fitting proposal over a larger proposal, assuming both proposals adequately cover the action in question. It can also be seen that the video splitting technique also has a positive effect (PMIL+F+S). Both observations empirically validate our conjecture that MIL-based techniques generalize well, as long as proposal ambiguity is minimized.

Figure 5 plots the mAP score versus the IOU threshold. As expected, increasing the threshold parameter causes the mAP score to decrease significantly. Recall that when the threshold is increased the scoring algorithm is forced to be more selective as to what proposals it considers correct, thereby increasing the classification error.

5.2.1 mSERO

Table 3 lists the mSERO scores for all action class in the considered experiments. Lower values imply that a proposal that is close to the optimal one has been selected. On the contrary, higher values indicate that the model selected a bad proposal.Recall, an optimal proposal is one that has high IOU relative to the ground truth. The best performing class-specific models are those for ‘diving’ followed closely by *‘kicking’*and ‘swinging-bench’. The effect of set-splitting (PMIL+F+S) on localization performance is significant.

Figure 6 examines the localization performance of two different models. Each figure plots all the proposals from a single video, highlighting the exact probability assigned by our model versus the actual IOU with respect to the ground truth. The proposal that actually selected (orange circle) by the model as well as the optimal proposal (green circle) are highlighted. The red section depicts the minimum amount of IOU (20%) a selected proposal has to exceed in-order for it be considered ‘correct’. From these graphs one can easily see how the best performing model (i.e., that for ‘diving’) assigns a high probability to the proposal closer to the optimum (in terms of IOU). On the other hand, a sub-optimal model (such as that for ‘skateboarding’) assigns a high probability to a proposal with a low IOU.

5.2.2 Qualitative Discussion

Table 4 compares the results achieved by our methodology with that of [28]. The results clearly demonstrate the effectiveness of our approach (note that [28] does not breakdown scores on per class basis). Figure 7 illustrates successfully and unsuccessfully localised action instance of the UCF-Sports data-set. Interestingly, the frame showcasing the ‘walking’ action depicts three human bodies who appear to be walking. However, the provided ground truth annotation only covers one of these instances of the walking action. Our model picked the proposal covering the most salient body. Objectively, this is correct answer. Similarly, in the ‘diving’ frame one can argue that the prediction given by our model is even better then the one provided by the ground-truth. Other similar examples can be provided.

6 Conclusion

In this work we proposed a novel framework for the localization and classification of space-time actions under weakly labelled conditions. Experimental results on the UCF-Sports data-set prove the effectiveness of our approach. In summary, we first generate temporally consistent proposals. Then a CNNis used to extract RGB features from each proposal. Finally, we use a set-splitting technique to reduce ambiguity between different proposals and employ a probabilistic formulation of MIL. The latter serves to make the resulting optimization function more tractable while also enabling the integration of prior knowledge. While we leave this for future work, this can be seen as stepping stone in that direction.

Acknowledgements

This work was in part supported by the Malta ENDEAVOUR scholarship scheme. The work was also partly supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 779813 (SARAS).

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Stuart Andrews, Thomas Hofmann, and Ioannis Tsochantaridis. Multiple instance learning with generalized support vector machines. 2002.
2[2] Harkirat S Behl, Michael Sapienza, Gurkirt Singh, Suman Saha, Fabio Cuzzolin, and Philip HS Torr. Incremental tube construction for human action detection. ar Xiv preprint ar Xiv:1704.01358 , 2017.
3[3] Simone Bianco, Gianluigi Ciocca, Paolo Napoletano, and Raimondo Schettini. An interactive tool for manual, semi-automatic and automatic video annotation. Computer Vision and Image Understanding , 131:88–99, 2015.
4[4] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with posterior regularization. In British Machine Vision Conference , 2014.
5[5] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. ar Xiv preprint ar Xiv:1405.3531 , 2014.
6[6] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly supervised object localization with multi-fold multiple instance learning. ar Xiv preprint ar Xiv:1503.00949 , 2015.
7[7] Pedro F Felzenszwalb, Ross B Girshick, David Mc Allester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 32(9):1627–1645, 2010.
8[8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2014.