Large Scale Holistic Video Understanding

Ali Diba; Mohsen Fayyaz; Vivek Sharma; Manohar Paluri; Jurgen Gall,; Rainer Stiefelhagen; Luc Van Gool

arXiv:1904.11451·cs.CV·December 16, 2020

Large Scale Holistic Video Understanding

Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall,, Rainer Stiefelhagen, Luc Van Gool

PDF

1 Repo

TL;DR

This paper introduces HVU, a large-scale dataset for holistic video understanding, and proposes HATNet, a neural network architecture that fuses appearance and temporal features for multi-label, multi-task video analysis.

Contribution

The paper presents HVU, a comprehensive dataset with 572k videos and 9 million annotations across multiple semantic categories, and introduces HATNet, a novel neural network architecture for holistic video understanding.

Findings

01

HVU enables multi-task, multi-label video analysis across diverse semantic categories.

02

HATNet effectively fuses 2D and 3D features for improved video classification.

03

Holistic representations improve performance in video classification, captioning, and clustering.

Abstract

Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale "Holistic Video Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally…

Figures20

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: Statistics of machine generated tags of HVU training set for different categories. The category with the highest number of labels and annotations is the object category.

Task Category	Scene	Object	Action	Event	Attribute	Concept	Total
#Labels	419	2651	877	149	160	122	4378
#Annotations	1,485,154	5,944,277	1,552,920	918,696	1,036,308	965,077	11,902,432
#Videos	366,941	480,821	481,418	320,428	368,668	375,664	481,418

Table 2. Table 2: Performance comparison between machine generated and human-verified tags of HVU. This evaluation shows how human annotation process is crucial to have a more efficient dataset. The CNN model which is used for this experiment is 3D-ResNet18.

Dataset	Scene	Object	Action	Event	Attribute	Concept	HVU Overall $%$
Machine-Generated HVU	46.3	22.4	43.8	31.4	25.3	20.1	31.6
Human-Annotation HVU	50.1	27.9	46.7	35.7	29.2	23.2	35.4

Table 3. Table 3: Evaluation of training Kinetics with HVU labels.

Training Labels	Action Recognition Performance
Action	65.6
Action + HVU	68.8

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holistic-video-understanding/HVU-Dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: 1KU Leuven, 2University of Bonn,3KIT, Karlsruhe, 4ETH Zürich, 5Sensifai

{firstname.lastname}@kuleuven.be, {lastname}@iai.uni-bonn.de, {firstname.lastname}@kit.edu, [email protected]

Supplementary Material:

Large Scale Holistic Video Understanding

Ali Diba1,5⋆

Mohsen Fayyaz2,⋆

Vivek Sharma3,⋆

Manohar Paluri

Jürgen Gall2

Rainer Stiefelhagen3

Luc Van Gool1,4,5

Appendix: This document provides supplementary material as mentioned in the main paper.

Appendix 0.A HVU Dataset

††*⋆*Ali Diba, Mohsen Fayyaz and Vivek Sharma contributed equally to this work and listed in alphabetical order.

0.A.1 Human Annotation Details

The row machine generated annotations consist almost 8K labels. The initial stage of human verification on validation set resulted in 4378 labels. And the final stage of complete human verification/modification process ended up in 3142 labels. In human annotation process, 80 new labels are added by human annotators.

In specific for the HVU human verification task, we employed three different teams (Team-A, Team-B and Team-C) of 55 human annotators. Team-A works on the taxonomy of the dataset. This team builds the taxonomy based on the visual meaning and definition of the tags obtained from APIs prediction. Team-B and Team-C are the verification teams and perform four tasks. The tasks they performs are: (a) verify the tags of videos by watching each video and flag false tags; (b) review the tags by watching the videos of each tag and flag the wrong videos; (c) add tags to the videos if some tags are missing; and (d) they suggest modification on tags such as, renaming or merging.

To make sure both Team-B and Team-C have a clear understanding of the tags and the corresponding videos, we ask them to use the provided tags definition from Team-A. For the aforementioned four tasks, Team-B goes through all the videos and provides the first round of clean annotations. Followed by this, Team-C reviews the annotations from Team-B to guarantee an accurate and cleaner version of annotations. Finally, Team-A reviews the suggestions provided from tasks (c) and (d) and apply them to the dataset. The verification process takes $\sim$ 100 seconds on average per video clip for a trained worker. It took about 8500 person-hours to firstly clean the machine-generated tags and remove errors and secondly add any possible missing labels from the dictionary. By incorporating the machine generated tags and human annotation, the HVU dataset covers a diverse set of tags with clean annotations. Using machine generated tags in the first step helps us to cover larger number of tags than a human can remember and label it in a reasonable time.

To make sure that we have a balanced distribution of samples per tag, we consider a minimum number of 50 samples.

To provide more details regarding the HVU human annotation process, we report the statistics of the different stages of the annotation process. Table 1 shows the statistics of the machine generated annotations of training set. Note, that the labels and categories are result of the initial human annotation process over the validation set of the dataset. The category with the highest number of labels and annotations is the object category. Concept is the category with the lowest number of labels. To have a better understanding of the statistics of the annotations we depict the distribution of categories with respect to the number of annotations, labels, and annotations per label in Figure 1. We can observe that the object category has the highest quota of labels and annotations, which is due to the abundance of objects in video. Despite having the highest quota of the labels and annotations, the object category does not have the highest annotations per label ratio. Figure 2 shows the percentage of the different subsets of the main categories. There are 50 different sets of videos based on assigned semantic categories. About $36\%$ of the videos have all of the categories.

0.A.2 Effect of Human Annotation

To present the impact of human annotation process, we have evaluated both versions of the HVU with machine-generated tags and human-annotated tags. We have trained two 3D-ResNet18 for each set and the comparison came in Table 2.

0.A.3 HVU Samples

We present some samples of videos and their corresponding tags in Fig 3 and Fig 4.

0.A.4 Effect of Additional Categories on Kinetics

One of our arguments in our paper is about how more semantic categories like object, scene, etc can lead to learn effective video representation. We have shown results on the HVU dataset in the paper. Here, we provided the similar experiment for the Kinetics-600 as a subset of our HVU. We have compared performance of a 3D-ResNet18 trained on Kinetics videos with its action labels versus trained on full HVU labels for the same videos. For the evaluation, we have measured the performance on Kinetics action labels. It can be seen in Table 3 that having more semantic labels in the training for Kinetics, improves the action classification performance. It is due to the fact that HVU can bring more capabilities to the deep models for learning new visual features for understanding videos.