Query-adaptive Video Summarization via Quality-aware Relevance   Estimation

Arun Balajee Vasudevan; Michael Gygli; Anna Volokitin; Luc Van Gool

arXiv:1705.00581·cs.CV·September 29, 2017

Query-adaptive Video Summarization via Quality-aware Relevance Estimation

Arun Balajee Vasudevan, Michael Gygli, Anna Volokitin, Luc Van Gool

PDF

1 Repo

TL;DR

This paper introduces a query-adaptive video summarization method that leverages a neural network-based semantic embedding to select relevant, diverse, and high-quality frames, outperforming existing approaches.

Contribution

The authors propose a novel framework for query-relevant video summarization using semantic embeddings and introduce a new dataset for evaluation.

Findings

01

Outperforms state-of-the-art relevance prediction methods

02

Achieves better diversity and relevance in summaries

03

Demonstrates effectiveness on a newly created annotated dataset

Abstract

Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore,…

Figures24

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1. Comparison of different model configurations trained on a subset of the Clickture dataset and fine-tuned on our Video Thumbnail dataset (RAD). We report the HIT@1 (fraction of times we select a “Very Good” or “Good” thumbnail), the Spearman correlation of our model predictions with the true candidate thumbnail scores, and mean average precision. The Huber+LSTM+ Q e x p l i subscript 𝑄 𝑒 𝑥 𝑝 𝑙 𝑖 Q_{expli} model performs best.

	Settings			Metrics
Method	Cost	LSTM	Quality		HIT@1 VG or G	Spear Corr.	mAP
Random	-	-	-		57.17 $\pm$ 1.5	-	0.5780
Loss of Liu et al.	$l_{1}$	$\times$	$\times$		68.75	0.186	0.6308
Ours: L1	$l_{1}$	$\times$	$\times$		68.09	0.209	0.6348
Ours: Huber	$l_{h u b e r}$	$\times$	$\times$		68.35	0.279	0.6446
Loss of Liu et al. + LSTM	$l_{1}$	$✓$	$\times$		70.62	0.270	0.6507
Ours: Huber + LSTM	$l_{h u b e r}$	$✓$	$\times$		72.63	0.367	0.6685
Ours: Frame quality only $Q_{e x p l i}$	$l_{h u b e r}$	$\times$	$✓$		65.95	0.236	0.6315
Ours: Huber + LSTM + $Q_{i m p l i}$	$l_{h u b e r}$	$✓$	$✓$		70.76	0.371	0.6657
Ours: Huber + LSTM + $Q_{e x p l i}$	$l_{h u b e r}$	$✓$	$✓$		74.76	0.376	0.6712

Table 2. Table 2. Comparison of thumbnail selection performance against the state of the art, on the QTS evaluation dataset. Note that (Liu et al . , 2015 ) uses queries for their method which are not publicly available (see text).

		HIT @ 1
	Method	VG	VG or G	Spear. $ρ$	mAP
Queries
	Liu et al. (Liu et al., 2015)	40.625	73.83	0.122	0.629
Titles
	QAR without $Q_{e x p l i}$	36.71	72.63	0.367	0.6685
	QAR (Ours)	38.86	74.76	0.376	0.6712

Table 3. Table 3. Performance of our relevance models on the RAD dataset in comparison with previous methods.

	Method	HIT@1	Spear. $ρ$	mAP
No textual input
	Random	66.6 $\pm$ 3.5	$0.0$	0.674
	Video2GIF (Gygli et al., 2016)	67.0	0.167	0.708
	Ours: Frame quality $Q_{e x p l i}$	69.0	0.135	0.749
Titles
	Liu et al. (Liu et al., 2015) +LSTM	70.0	0.134	0.731
	QAR without $Q_{e x p l i}$	70.0	0.182	0.743
	QAR (Ours)	71.0	0.221	0.760
Queries
	Liu et al. (Liu et al., 2015) +LSTM	72.0	0.204	0.730
	QAR without $Q_{e x p l i}$	76.0	0.268	0.752
	QAR (Ours)	72.0	0.264	0.769

Table 4. Table 4. Performance of summarization methods on the RAD dataset. Repr means Representativeness. ✓ ✓ \checkmark and − - depict whether an objective was used or not. MMR and ours learn their corresponding weights. Percentage in parentheses the normalized learnt weights. Upper bound refers to the best possible performance, obtained using the ground truth annotations of RAD.

Method				$< P R >$	$< C R >$	$< F 1 >$
Similarity	Diversity	Quality	Repr
$-$	$-$	$-$	$✓$	0.654	0.817	0.672
$-$	$-$	$✓$	$-$	0.671	0.542	0.522
$-$	$✓$	$-$	$-$	0.575	0.808	0.629
$✓$	$-$	$-$	$-$	0.763	0.550	0.578
$✓$	$-$	$✓$	$-$	0.775	0.563	0.594
MMR (Carbonell and Goldstein, 1998)
$✓$ (33%)	$✓$ (66%)	$-$	$-$	0.692	0.825	0.716
Hecate (Song et al., 2016)				0.708	0.787	0.713
Ours
$✓$ (45%)	$✓$ (43%)	$✓$ (2%)	$✓$ (10%)	0.704	0.825	0.721
Upper bound				0.938	0.925	0.928

Equations16

s (t, v) = \frac{t \cdot v}{∥ t ∥∥ v ∥} .

s (t, v) = \frac{t \cdot v}{∥ t ∥∥ v ∥} .

r (t, v) = s (t, v) + q_{v},

r (t, v) = s (t, v) + q_{v},

r (t, v^{+}) > r (t, v^{-}) .

r (t, v^{+}) > r (t, v^{-}) .

s (t, v^{+})

s (t, v^{+})

q_{v^{+}}

l oss (t, v^{+}, v^{-})

l oss (t, v^{+}, v^{-})

+ l_{p} (max (0, γ - q_{v^{+}} + q_{v^{-}})),

y^{*} = ar g y \in Y_{V} max w^{T} f (x_{V}, y),

y^{*} = ar g y \in Y_{V} max w^{T} f (x_{V}, y),

N M I (C, C^{'}) = \frac{2 \cdot I ( C , C ^{'} )}{H ( C ) + H ( C ^{'} )},

N M I (C, C^{'}) = \frac{2 \cdot I ( C , C ^{'} )}{H ( C ) + H ( C ^{'} )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arunbalajeev/query-video-summary
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Query-adaptive Video Summarization via Quality-aware Relevance Estimation

Arun Balajee Vasudevan111Authors contributed equally22footnotemark: 2, Michael Gygli111Authors contributed equally22footnotemark: 233footnotemark: 3, Anna Volokitin22footnotemark: 2, Luc Van Gool22footnotemark: 244footnotemark: 4

22footnotemark: 2ETH Zurich 44footnotemark: 4 KU Leuven 33footnotemark: 3 Gifs.com

arunv,gygli,anna.volokitin,[email protected]

(2017)

Abstract.

Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. On this dataset, we train and test our complete model for video summarization and show that it outperforms standard baselines such as Maximal Marginal Relevance.

††copyright: acmlicensed††journalyear: 2017††conference: MM’17; ; October 23–27, 2017, Mountain View, CA, USA.††price: 15.00††doi: https://doi.org/10.1145/3123266.3123297††isbn: ISBN 978-1-4503-4906-2/17/10

1. Introduction

Video recording devices have become omnipresent. Most of the videos taken with smartphones, surveillance cameras and wearable cameras are recorded with a capture first, filter later mentality. However, most raw videos never end up getting curated and remain too long, shaky, redundant and boring to watch. This raises new challenges in searching both within and across videos.

The problem of making videos content more accessible has spurred research in automatic tagging (Qi et al., 2007; Ballan et al., 2015; Mazloom et al., 2016) and video summarization (Sun et al., 2014a; Gygli et al., 2015; Potapov et al., 2014; Lee et al., 2012; Khosla et al., 2013; Lu and Grauman, 2013; Arev et al., 2014; Kim et al., 2014; Zhao and Xing, 2014). In automatic tagging, the goal is to predict meta-data in form of tags, which makes videos searchable via text queries. Video summarization, on the other hand, aims at making videos more accessible by reducing them to a few interesting and representative frames (Khosla et al., 2013; Lee et al., 2012) or shots (Gygli et al., 2015; Song et al., 2015).

This paper combines the goals of summarising videos and makes them searchable with text. Specifically, we propose a novel method that generates video summaries adapted to a text query (See Fig. 1). Our approach improves previous works in the area of textual-visual embeddings (Kiros et al., 2014; Liu et al., 2015) and proposes an extension of an existing video summarization method using submodular mixtures (Gygli et al., 2015) for creating summaries that are query-adaptive.

Our method for creating query-relevant summaries consists of two parts. We first develop a relevance model which allows us to rank frames of a video according to their relevance given a text query. Relevance is computed as the sum of the cosine similarity between embeddings of frames and text queries in a learned visual-semantic embedding space and a query-independent term. While the embedding captures semantic similarity between video frames and text queries, the query-independent term predicts relevance based on the quality, composition and the interestingness of the content itself. We train this model on a large dataset of image search data (Hua et al., 2013) and our newly introduced Relevance and Diversity dataset (Section 5). The second part of the summarization system is a framework for optimising the selected set of frames not only for relevance, but also for representativeness and diversity using a submodular mixture of objectives. Figure 2 shows an overview of our complete pipeline. We publish our codes and demos 222https://github.com/arunbalajeev/query-video-summary and make the following contributions:

•

Several improvements on learning a textual-visual embedding for thumbnail selection compared to the work by Liu et al. (Liu et al., 2015). These include better alignment of the learning objective to the task at test time and modeling the text queries using LSTMs, fetching significant performance gains.

•

A way to model semantic similarity and quality aspects of frames jointly, leading to better performance compared to using the similarity to text queries only.

•

We adapt the submodular mixtures model for video summarization by Gygli et al. (Gygli et al., 2015) to create query-adaptive and diverse summaries using our frame-based relevance model.

•

A new video thumbnail dataset providing query relevance and diversity labels. As the judgements are subjective, we collect multiple annotations per video and analyse the consistency of the obtained labelling.

2. Related Work

The goal of video summarization is to select a subset of frames that gives a user an idea of the video’s content at a glance (Truong and Venkatesh, 2007). To find informative frames for this task, two dominant approaches exist: (i) modelling generic frame interestingness (Lee et al., 2012; Gygli et al., 2016) or (ii) using additional information such as the video title or a text query to find relevant frames (Liu et al., 2009; Song et al., 2015; Liu et al., 2015). In this work we combine the two into one model and make several contributions for query-adaptive relevance prediction. Such models are related to automatic tagging (Qi et al., 2007; Ballan et al., 2015; Mazloom et al., 2016), textual-visual embeddings (Frome et al., 2013; Socher et al., 2014; Liu et al., 2015) and image description (Das et al., 2013; Barbu et al., 2012; Karpathy and Li, 2015; Donahue et al., 2015; Mao et al., 2014; Chen and Zitnick, 2014; Karpathy et al., 2014; Fang et al., 2015) . In the following we discuss approaches for video summarization, generic interestingness prediction models and previous works for obtaining embeddings.

Video summarization. Video summarization methods can be broadly classified into abstractive and extractive approaches. Abstractive or compositional approaches transform the initial video into a more compact and appealing representation, e.g. hyperlapses (Kopf et al., 2014), montages (Sun et al., 2014b) or video synopses (Pritch et al., 2008). The goal of extractive methods is instead to select an informative subset of keyframes (Wolf, 1996; Lee et al., 2012; Khosla et al., 2013; Kim et al., 2014) or video segments (Gygli et al., 2015; Lu and Grauman, 2013) from the initial video. Our method is extractive. Extractive methods need to optimise at least two properties of the summary: the quality of the selected frames and their diversity (Sharghi et al., 2016; Gygli et al., 2015; Gong et al., 2014). Sometimes, additional objectives such as temporal uniformity (Gygli et al., 2015) and relevance (Sharghi et al., 2016) are also optimised. The simplest approach to obtain a representative and diverse summary is to cluster videos into events and select the best frame per event (de Avila et al., 2011). More sophisticated approaches jointly optimise for importance and diversity by using determinantal point process (DPPs) (Gong et al., 2014; Sharghi et al., 2016; Zhang et al., 2016) or submodular mixtures (Lin and Bilmes, 2012; Gygli et al., 2015). Most related to our paper is the work of Sharghi et al. (Sharghi et al., 2016), who present an approach for query-adaptive video summarization using DPPs. Their method however limits to a small, fixed set of concepts such as car or flower. The authors leave handling of unconstrained queries, as in our approach, for future work. In this work, we formulate video summarization as a maximisation problem over a set of submodular functions, following (Gygli et al., 2015).

Frame quality/interestingness. Most methods that predict frame interestingness are based on supervised learning. The prediction problem can be formulated as a classification (Potapov et al., 2014), regression (Lee et al., 2012; Zen et al., 2016), or, as is now most common, as a ranking problem (Sun et al., 2014a; Gygli et al., 2016; Yao et al., 2016; Sun et al., 2017). To simplify the task, some approaches assume the domain of the video given and train a model for each domain (Potapov et al., 2014; Sun et al., 2014a; Yao et al., 2016).

An alternative approach based on unsupervised learning, proposed by Xiong et al. (Xiong and Grauman, 2014), detects “snap points” by using a web image prior. Their model considers frames suitable as keyframes if the composition of the frames matches the composition of the web images, regardless of the frame content. Our approach is partially inspired by this work in that it predicts relevance even in the absence of a query, but relies on supervised learning.

Unconstrained Textual-visual models. Several methods exist that can retrieve images given unconstrained text or vice versa (Frome et al., 2013; Mao et al., 2014; Karpathy et al., 2014; Karpathy and Li, 2015; Donahue et al., 2015; Fang et al., 2015; Habibian et al., 2016). These typically project both modalities into a joint embedding space (Frome et al., 2013), where semantic similarity can be compared using a measure like cosine similarity. Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are popular choices to obtain the embeddings of text. Deep image features are then mapped to the same space via a learned projection. Once both modalities are in the same space, they may be easily compared (Frome et al., 2013). A multi-modal semantic embedding space is often used by Zero-shot learning approaches (Frome et al., 2013; Norouzi et al., 2013; Jain et al., 2015) to predict test labels which are unseen in the training. Habibian et al. (Habibian et al., 2016), in the same spirit, propose zero-shot recognition of events in videos by learning a video representation that aligns text, audio and video features. Similarly, Liu et al. (Liu et al., 2015) use textual-visual embeddings for video thumbnail selection. Our relevance model is based on Liu et al. (Liu et al., 2015), but we provide several important improvements. (i) Rather than keeping the word representation fixed, we jointly optimise the word and image projection. (ii) Instead of embedding each word separately, we train an LSTM model that combines a complete query into one single embedding vector, thus it even learns multi-word combinations such as visit to lake and Star Wars movie. (iii) In contrast to Liu et al. (Liu et al., 2015), we directly optimise the target objective. Our experiments show that these changes lead to significantly better performance in predicting relevant thumbnails.

3. Method for Relevance Prediction

The goal of this work is to introduce a method to automatically select a set of video thumbnails that are both relevant with respect to a query, but also diverse enough to represent the video. To later optimise relevance and diversity jointly, we first need a way to evaluate the relevance of frames.

Our relevance model learns a projection of video frames $v$ and text queries $t$ into the same embedding space. We denote the projection of $t$ and $v$ as $\mathbf{t}$ and $\mathbf{v}$ , respectively. Once trained, the relevance of a frame $v$ given a query $t$ can be estimated via some similarity measure. As (Frome et al., 2013), we use the cosine similarity

[TABLE]

While this lets us assess the semantic relevance of a frame w.r.t. a query, it is also possible to make a prediction on the suitability as thumbnails a priori, based on the frame quality, composition, etc. (Xiong and Grauman, 2014). Thus, we propose to extend above notion of relevance and model the quality aspects of thumbnails explicitly by computing the final relevance as the sum of the embedding similarity and the query-independent frame quality term, i.e.

[TABLE]

where $q_{v}$ is a query-independent score determining the suitability of $v$ as a thumbnail, based on the quality of a frame.

In the following, we investigate how to formulate the task of obtaining the embeddings $\mathbf{t}$ and $\mathbf{v}$ , as well as $q_{v}$ .

3.1. Training objective

Intuitively, our model should be able to answer “What is the best thumbnail for this query?”. Thus, the problem of picking the best thumbnail for a video is naturally formulated as a ranking problem. We desire that the embedding vectors of a query and frame that are a good match are more similar than ones of the same query and a non-relevant frame***Liu et al. (Liu et al., 2015) does the inverse. It poses the problem as learning to assign a higher similarity to corresponding frame and query than to the same frame and a random query. Thus, the model learns to answer the question “what is a good query for this image?”. . Thus, our model should learn to satisfy the rank constraint that given a query $t$ , the relevance score of the relevant frame $v^{+}$ is higher than the relevance score of the irrelevant frame $v^{-}$ :

[TABLE]

Alternatively, we can train the model by requiring that both the similarity score and the quality score of the relevant frame are higher than for the irrelevant frame explicitly, rather than imposing a constraint only on their sum, as above. In this case we would be imposing the two following constraints:

[TABLE]

Experimentally, we find that training with these explicit constraints leads to slightly improved performance (See Tab. 1).

In order to impose these constraints and train the model, we define the loss as

[TABLE]

where $l_{p}$ is a cost function and $\gamma$ is a margin parameter. We follow (Gygli et al., 2016) and use a Huber loss for $l_{p}$ , *i.e. *the robust version of an $l_{2}$ loss. Next, we describe how to parametrize the $\mathbf{t}$ , $\mathbf{v}$ and $q_{v}$ , so that they can be learned.

3.2. Text and Frame Representation

We use a convolutional neural network (CNN) for predicting $\mathbf{v}$ and $q_{v}$ , while $\mathbf{t}$ is obtained via a recurrent neural network. To jointly learn the parameters of these networks, we use a Siamese ranking network, trained with triplets of $(t,v^{+},v^{-})$ where the weights for the subnets predicting $v^{+}$ and $v^{-}$ are shared. We provide the model architecture in supplementary material. We now describe the textual representation $\mathbf{t}$ and the image representations $\mathbf{v}$ and $q_{v}$ in more detail.

Textual representation. As a feature representation $\mathbf{t}$ of the textual query $t$ , we first project each word of the query into a $300$ -dimensional semantic space using the word2vec model (Mikolov and Dean, 2013), which is trained on GoogleNews dataset. We fine-tune the word2vec model using the unique queries from the Bing Clickture dataset (Hua et al., 2013) as sentences. Then, we encode the individual word representations into a single fixed-length embedding using an LSTM (Hochreiter and Schmidhuber, 1997). We use a many-to-one prediction, where the model outputs a fixed length output at the final time-step. This allows us to emphasize visually informative words and handle phrases.

Image representation. To represent the image, we leverage the feature representations of a pre-trained VGG-19 network (Simonyan and Zisserman, 2014) on ImageNet (Deng et al., 2009). We replace the softmax layer(1000 nodes) of VGG-19 network with a linear layer $M$ with 301 dimensions. The first 300 dimensions are used as the embedding $\mathbf{v}$ , while the last dimension represents the quality score $q_{v}$ .

4. Summarization model

We use the framework of submodular optimization to create summaries that take into account multiple objectives (Lin and Bilmes, 2012). In this framework, summarization is posed as the problem of selecting a subset (in our case, of frames) $\mathbf{y^{*}}$ that maximizes a linear combination of submodular objective functions $\mathbf{f}(\mathbf{x_{\mathcal{V}},y})=[f_{1}(\mathbf{x_{\mathcal{V}},y}),...,f_{n}(\mathbf{x_{\mathcal{V}},y})]^{T}$ . Specifically,

[TABLE]

where $\mathcal{Y_{V}}$ denote the set of all possible solutions $\mathbf{y}$ and $\mathbf{x_{\mathcal{V}}}$ the features of video $\mathcal{V}$ . In this work, we assume that the cardinality $|\mathbf{y}|$ is fixed to some value $k$ (we use $k=5$ in our experiments).

For non-negative weights $\mathbf{w}$ , the objective in Eq. (6) is submodular (Krause and Golovin, 2012), meaning that it can be optimized near-optimally in an efficient way using a greedy algorithm with lazy evaluations (Nemhauser et al., 1978; Minoux, 1978).

Objective functions. We choose a small set of objective functions, each capturing different aspects of the summary.

(1)

Query similarity $\mathbf{f}(\cdot,\cdot)=\sum_{v\in\mathbf{y}}s(\mathbf{t},\mathbf{v})$ where $\mathbf{t}$ is the query embedding, $\mathbf{v}$ is frame embedding and $s(\cdot,\cdot)$ denotes the cosine similarity defined in Eq. (1). 2. (2)

Quality score $\mathbf{f}(\cdot,\cdot)=\sum_{v\in\mathbf{y}}q_{v}$ , where $q_{v}$ represents score that is based on the quality of $v$ as a thumbnail. This model scores the image relevance in a query-independent manner based on properties such as contrast, composition, etc. 3. (3)

Diversity of the elements in the summary

$\mathbf{f}(\mathbf{x_{\mathcal{V}},y})=\sum_{i\in\bf{y}}\min\limits_{j<i}D_{x_{\mathcal{V}}}(i,j)$ , according to some dissimilarity measure $D$ . We use the Euclidean distance in of the FC2 features of the VGG-19 network for $D$ †††Derivation of submodularity of this objective is provided in the suppl.. 4. (4)

Representativeness (Gygli et al., 2015). This objective favors selecting the medoid frames of a video, such that the visually frequent frames in the video are represented in the summary.

Weight learning. To learn the weights $\mathbf{w}$ in Eq. (6), ground truth summaries for query-video pairs are required. Previous methods typically only optimized for relevance (Liu et al., 2015) or used small datasets with limited vocabularies (Sharghi et al., 2016). Thus, to be able to train our model, we collected a new dataset with relevance and diversity annotations, which we introduce in the next Section.

If relevance and diversity labels are known, we can estimate the optimal mixing weights of the submodular functions through subgradient descent (Lin and Bilmes, 2012). In order to directly optimize for the F1-score used at test time, we use a locally modular approximation based on the procedure of (Narasimhan and Bilmes, 2012) and optimize the weights using AdaGrad (Duchi et al., 2011).

5. Relevance And Diversity Dataset (RAD)

We collected a dataset with query relevance and diversity annotation to let us train and evaluate query-relevant summaries. Our dataset consists of $200$ videos, each of which was retrieved given a different query.

Using Amazon Mechanical Turk (AMT) we first annotate the video frames with query relevance labels, and then partition the frames into clusters according to visual similarity. These kind of labels were used previously in the MediaEval diverse social images challenge (Ionescu et al., 2015) and enabled evaluation of the automatic methods for creating relevant and diverse summaries.

To select a representative sample of queries and videos for the dataset, we used the following procedure: We take the top YouTube queries between $2008$ and $2016$ from $22$ different categories as seed queries‡‡‡https://www.google.com/trends/explore. These queries are typically rather short and generic concepts, so to obtain longer, more realistic queries we use YouTube auto-complete to suggest phrases. Using this approach we collect $200$ queries. Some examples are brock lesnar vs big show, taylor swift out of the woods, etc. For each query, we take the top video result with a duration of $2$ to $3$ minutes.

To annotate the videos, we set up two consecutive tasks on AMT. All videos are sampled at one frame per second. In the first task, a worker is asked to label each frame with its relevance w.r.t. the given query. Options for answers are “Very Good”,“Good”, “Not good” and “Trash”, where trash indicates that the frame is both irrelevant and low-quality (*e.g. *blurred, bad contrast, etc.). After annotating the relevance, the worker is asked to distribute the frames into clusters according to their visual similarity. We obtain one clustering per worker, where each clustering consists of mutually exclusive subsets of video frames as clusters. The number of clusters in the clustering is chosen by the worker. Each video is annotated by $5$ different people and a total of $48$ subjects participated in the annotation. To ensure high-quality annotations, we defined a qualification task, where we check the results manually to ensure the workers provide good annotations. Only workers who pass this test are allowed to take further assignments.

5.1. Analysis

We now analyse the two kinds of annotations obtained through this procedure and describe how we merge these annotations into one set of ground truth labels per video.

Label distributions. The distribution of relevance labels is “Very Good”: $17.55\%$ , “Good”: $57.40\%$ , “Not good”: $12.31\%$ and “Trash”: $12.72\%$ . The minimum, maximum and mean number of clusters per video are $4.9$ , $25.2$ and $13.4$ respectively over all videos of RAD.

Relevance annotation consistency. Given the inherent subjectivity of the task, we want to know whether annotators agree with each other about the query relevance of frames. To do this, we follow previous work (Isola et al., 2011; Gygli et al., 2013; Wang et al., 2016) and compute the Spearman rank correlation ( $\rho$ ) between the relevance scores of different subjects, splitting five annotations of each video into two groups of two and three raters each. We take all split combination to find mean $\rho$ for a video.

Our dataset has an average correlation of $\rho=0.73$ over all videos, where $1$ is a perfect correlation while [math] would indicate no consistency in the scores. On the related task of event-specific image importance, using five annotators, consistency is only $\rho=0.4$ (Wang et al., 2016). Thus, we can be confident that our relevance labels are of high quality.

Cluster consistency. To the best of our knowledge, we are the first to annotate multiple clusterings per video and look into the consistency of multiple annotators. MediaEval, for example, used multiple relevance labels but only one clustering (Ionescu et al., 2015). Various ways of measuring the consistency of clusterings exist, *e.g. *Variation of Information, Normalised Mutual Information or the Rand index (See Wagner and Wagner (Wagner and Wagner, 2007) for an excellent overview). In the following we propose to use Normalised Mutual Information (NMI), an information theoretic measure (Fred and Jain, 2003) which is the ratio of the mutual information between two clusterings ( $I(C,C^{\prime})$ ) and the sum of entropies of the clusterings ( $H(C)+H(C^{\prime})$ ):

[TABLE]

We chose NMI over the more recently proposed Variation of Information (VI) (Meilă, 2003), as NMI has a fixed range ( $\left[0,1\right]$ ) while still being closely related to VI (see supplementary material).

Our dataset has a cluster consistency of $0.54$ . Since NMI is [math] if two clusterings are independent and $1$ iff they are identical, we see that our annotators have a high degree of agreement.

Ground truth For evaluation on the test videos, we create a single ground truth annotation for each video. We merge the five relevance annotations as well as the clustering of each query-video pair. For the final ground truth of relevance prediction, we require the labels be either positive or negative for each video frame. We map all “Very Good” labels to $1$ , “Good” labels to $0.5$ and “Not Good” and “Trash” labels to [math]. We compute the mean of the five relevance annotation labels and label the frame as positive if the mean is $\geq 0.5$ and as negative otherwise.

To merge clustering annotations, we calculate NMI between all pairs of clustering and choose the clustering with the highest mean NMI, *i.e. *the most prototypical cluster. An example of relevance and clustering annotation is provided in Fig. 6.

6. Configuration testing

Before comparing our proposed relevance model against state of the art in Sec. 7, we first analyze our model performance using different objectives, cost functions and text representation. For evaluation, we use Query-dependent Thumbnail Selection Dataset (QTS) provided by (Liu et al., 2015). The dataset contains $20$ candidate thumbnails for each video, each of which is labeled one of the five: Very Good (VG), Good (G), Fair (F), Bad (B), or Very Bad (VB). We evaluate on the available $749$ query-video pairs. To transform the categorical labels to numerical values, we use the same mapping as (Liu et al., 2015).

Evaluation metrics. As evaluation metrics, we are using HIT@1 and mean Average Precision (mAP) as reported and defined in Liu et al. (Liu et al., 2015), as well as the Spearman’s Rank Correlation. HIT@1 is computed as the hit ratio for the highest ranked thumbnail.

Training dataset. For training, we use two datasets: (i) the Bing Clickture dataset (Hua et al., 2013) and (ii) the RAD dataset (Sec. 5). Clickture is a large dataset consisting of queries and retrieved images from Bing Image search. The annotation is in form of triplets $(K,Q,C)$ meaning that the image $K$ was clicked $C$ times in the search results of the query $Q$ . This dataset is well suited for training our relevance model, since our task is the retrieval of relevant keyframes from a video, given a text query. It is, however, from the image and not the video domain. Thus, we additionally fine-tune the models on the complete RAD dataset consisting of $200$ query-video pairs. From each query-video pair, we sample an equi number of positive and negative frames to give equal weight to each video. In total, we use $0.5M$ triplets (as in Sec. 3.2) from the Clickture and $14K$ triplets from the RAD for training.

Implementation details. We preprocess the images as in (Simonyan and Zisserman, 2014). We truncate the number of words in the query at $14$ , as a tradeoff between the mean and maximum query length in Clickture dataset( $5$ and $26$ respectively) (Mueller and Thyagarajan, 2016). We set the margin parameter $\gamma$ in the loss in Eq. (5) to 1 and the tradeoff parameter $\delta$ for the Huber loss to $1.5$ as in (Gygli et al., 2016). The LSTM consists of a hidden layer with $512$ units. We train the parameters of the LSTM and projection layer $M$ using stochastic gradient descent with adaptive weight updates (AdaGrad) (Duchi et al., 2011). We add an $l_{2}$ penalty on the weights, with a $\lambda$ of $10^{-3}$ . We train for $20$ epochs using minibatches of $128$ triplets.

6.1. Tested components

We discuss three important components of our model next.

Objective. We compare our proposed training objective to that of Liu et al. (Liu et al., 2015). Their model is trained to rank a positive query higher than a negative query given a fixed frame. In contrast, our method is trained to rank a positive frame higher than a negative frame given a fixed query.

Cost function. We also investigate the importance of modeling frame quality. In particular, we compare different cost functions. (i) We enforce two ranking constraints: one for the quality term and one for the embedding similarity, as in Eq.(4) ( $Q_{expli}$ ), (ii) We sum the quality and similarity term into one output score, for which we enforce the rank constraint, as in Eq.(3) ( $Q_{impli}$ ) or (iii) we don’t model quality at all.

Text representation. As mentioned in Sec. 3.2, we represent the words of the query using word vectors. To combine the individual word representations into single vector, we investigate two approaches: (i) averaging the word embedding vectors and (ii) using an LSTM model that learns to combine the individual word embeddings.

6.2. Results

We show the results of our detailed experiments in Tab. 1. They give insights on several important points.

Text representation. Modeling queries with an LSTM, rather than averaging the individual word representations, improves performance significantly. This is not surprising, as this model can learn to ignore words that are not visually informative (*e.g. *2014).

Objective and Cost function. The analysis shows that training with our objective leads to better performance compared to using the objective of Liu et al. (Liu et al., 2015). This can be explained with the properties of videos, which typically contain many frames that are low-quality or not visually informative (Song et al., 2016). Thus, formulating the thumbnail task in a way that the model can learn about these quality aspects is beneficial. Using the appropriate triplets for training boosts performance substantially (correlation with the loss of Liu et al. (Liu et al., 2015) + LSTM: $0.270$ , Ours: Huber + LSTM $0.367$ ). When including a quality term in the model, performance improves further, where an explicit loss performs slightly better (Ours: Huber + LSTM + $Q_{expli}$ in Tab. 1).

Somewhat surprisingly, modeling quality alone already outperforms Liu et al. (Liu et al., 2015) in terms of mAP, despite not using any textual information. Quality adds a significant boost to performance in the video domain. Interestingly, this is different in the image domain, due to the difference in quality statistics. Images returned by a search engine are mostly of good quality, thus explicitly accounting for it does not improve performance (see supplementary material).

To conclude, we see that the better alignment of the objective to the keyframe retrieval task, the addition of an LSTM and modeling quality of the thumbnails improves performance. Together, they provide an substantial improvement compared to Liu et al. ’s model. Our method achieves an absolute improvement of $6.01$ % in HIT@1, $4.04$ % in mAP, and an improvement in correlation from $0.186$ to $0.376$ . These gains are even more significant when we consider the possible ranges of these metrics. *e.g. *for Spearman correlation, human agreement is at $0.73$ on the RAD dataset (c.f. Sec. 5.1), thus providing an upper bound. Similarly, HIT@1 and mAP have small effective ranges given their high scores for a random model.

7. Experiments

In the previous section, we have determined that our objective, embedding queries with an LSTM and explicitly modelling quality performs best. We call this model QAR (Quality-Aware Relevance) in the following and compare against state-of-the-art(s-o-a) models on the QTS and RAD datasets. We also evaluate the full summarization model on RAD. For these experiments, we split RAD into $100$ videos for training, $50$ for validation and $50$ for testing.

Evaluation metrics. For relevance we use the same metrics as in Sec. 6. To evaluate video summaries on RAD, we additionally use F1 scores. The F1 score is the harmonic mean of precision of relevance prediction and cluster recall (Ionescu et al., 2015). It is high, if a method selects relevant frames from diverse clusters.

7.1. Evaluating the Relevance Model

We evaluate our model (QAR) and compare it to Liu et al. (Liu et al., 2015) and Video2GIF (Gygli et al., 2016).

Query-dependent Thumbnail Selection Dataset (QTS) (Liu et al., 2015)

We compare against the s-o-a on the QTS evaluation dataset in Tab. 2. We report the performance of Liu et al. (Liu et al., 2015) from their paper. Note, however, that the results are not directly comparable, as they use query-video pairs for predicting relevance, while only the titles are shared publicly. Thus, we use the titles instead, which is an important difference. Relevance is annotated with respect to the queries, which often differ from the video titles. We compare the re-implementation of (Liu et al., 2015) using titles in detail in Tab. 1.

Encouragingly, our model performs well even when just using the titles and outperforms them on most metrics. It improves mAP by $4.22$ % over (Liu et al., 2015) and correlation by a margin of 0.254 (c.f. Table 2). Figure 3 shows the precision-recall curve for the experiment. As can be seen QAR outperforms (Liu et al., 2015) for all recall ratios. To better understand the effects of using titles or queries, we quantify the value of the two on the RAD dataset below.

Our dataset (RAD) We also evaluate our model on the RAD test set (Tab. 3). QAR (ours) significantly outperforms the previous s-o-a of (Liu et al., 2015; Gygli et al., 2016), even when augmenting Liu et al. (Liu et al., 2015) with an LSTM. QAR improves mAP by $2.9$ % when using Titles and $3.9$ % when using Queries over our implementation of Liu et al. (Liu et al., 2015)+LSTM.

We also see that modeling quality leads to significant gains in terms of mAP when using Titles or Queries ( $+1.7\%$ in both cases). HIT@1 for query relevance, however, is lower when including quality. We believe that the reason for this is that when the query is given, the textual-visual similarity is a more reliable signal to determine the single best keyframe. While including quality improves the overall ranking on mAP, it is solely based on appearance and thus seems to inhibit the fine-grained ranking results at low recall(Fig. 4). However, when only the title is used, the frame quality becomes a stronger predictor for thumbnail selection and improves performance on all metrics. We present some qualitative results of different methods for relevance prediction in Fig. 5.

7.2. Evaluating the Summarization Model

As mentioned in Sec. 4, we use four objectives for our summarization model. Referring to Tab. 4, we use QAR model to get Similarity and Quality scores while Diversity and Representativeness scores are obtained as described in Sec. 4. We compare the performance of our full model with each individual objective, a baseline based on Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) and Hecate (Song et al., 2016). MMR greedily builds a set that maximises the weighted sum of two terms: (i) The similarity of the selected elements to a query and (ii) The dissimilarity to previously selected elements. To estimate the similarity to the query we use our own model (QAR without $Q_{expli}$ ) and for dissimilarity the diversity as defined in Sec. 4. Finally, we compare it to Hecate, recently introduced in (Song et al., 2016). Hecate estimates frame quality using the stillness of the frame and selects representative and diverse thumbnails by clustering the video with k-means and selecting the highest quality frame from the k largest clusters.

Results Quantitative results are shown in Tab. 4, while Fig. 6 shows qualitative results. As can be seen, combining all objectives with our model works best. It outperforms all single objectives, as well as the MMR (Carbonell and Goldstein, 1998) baseline, even though MMR also uses our well-performing similarity estimation. Similarity alone has the highest precision, but tends to pick frames that are visually similar (c.f. Fig. 6), thus resulting in low cluster recall. Diversification objectives (diversity and representativeness) have a high cluster recall, but the frames are less relevant. Somewhat surprisingly, Hecate (Song et al., 2016) is a relatively strong baseline. In particular, it performs well in terms of relevance, despite using a simple quality score. This further highlights the importance of quality for the thumbnail selection task. It also indicates that the used VGG-19 architecture might be suboptimal for predicting quality. CNNs for classification use small input resolutions, thus making it difficult to predict quality aspects such as blur. Finding better architectures for that task is actively researched, *e.g. * (Lu et al., 2015; Mai et al., 2016), and might be used to improve our method.

When analysing the learned weights (c.f. Tab. 4) we find that the similarity prediction is the most important objective, which matches our expectations. Quality gets a lower, but non-zero weight, thus showing that it provides information that is complementary to query-similarity. Thus, it helps predicting the relevance of a frame. The reader should however be aware that differences in the variance of the objectives can affect the weights learned. Thus, they should be taken with a grain of salt and only be considered tendencies.

8. Conclusion

We introduced a new method for query-adaptive video summarization. At its core lies a textual-visual embedding, which lets us select frames relevant to a query. In contrast to earlier works, such as (Zeng et al., 2016; Sharghi et al., 2016), this model allows us to handle unconstrained queries and even full sentences. We proposed and empirically evaluated different improvements over (Liu et al., 2015), for learning a relevance model. Our empirical evaluation showed that a better training objective, a more sophisticated text model, and explicitly modelling quality leads to significant performance gains. In particular, we showed that quality plays an important role in the absence of high-quality relevance information, such as queries, *i.e. *when only the title can be used. Finally, we introduced a new dataset for thumbnail selection which comes with query-relevance labels and a grouping of the frames according to visual and semantic similarity. On this data, we tested our full summarization framework and showed that it compares favourably to strong baselines such as MMR (Carbonell and Goldstein, 1998) and (Song et al., 2016). We hope that our new dataset will spur further research in query adaptive video summarization.

9. Acknowledgements

This work has been supported by Toyota via the project TRACE-Zurich. We also acknowledge the support by the CHIST-ERA project MUSTER. MG was supported by the European Research Council under the project VarCity (#273940).

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Arev et al . (2014) I Arev, HS Park, and Yaser Sheikh. 2014. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG) (2014).
3Ballan et al . (2015) Lamberto Ballan, Marco Bertini, Giuseppe Serra, and Alberto Del Bimbo. 2015. A data-driven approach for tag refinement and localization in web videos. Computer Vision and Image Understanding (2015).
4Barbu et al . (2012) Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, and Zhiqi Zhang. 2012. Video In Sentences Out. UAI (2012). ar Xiv:ar Xiv:1204.2742 v 1
5Carbonell and Goldstein (1998) Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In ACM SIGIR .
6Chen and Zitnick (2014) Xinlei Chen and C Lawrence Zitnick. 2014. Learning a Recurrent Visual Representation for Image Caption Generation. Proceedings of Co RR (2014).
7Das et al . (2013) Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. CVPR (2013).
8de Avila et al . (2011) Sandra E. F. de Avila, Ana P. B. Lopes, A. da Luz, and A. de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011).