TL;DR
This paper introduces HVU, a large-scale dataset for holistic video understanding, and proposes HATNet, a neural network architecture that fuses appearance and temporal features for multi-label, multi-task video analysis.
Contribution
The paper presents HVU, a comprehensive dataset with 572k videos and 9 million annotations across multiple semantic categories, and introduces HATNet, a novel neural network architecture for holistic video understanding.
Findings
HVU enables multi-task, multi-label video analysis across diverse semantic categories.
HATNet effectively fuses 2D and 3D features for improved video classification.
Holistic representations improve performance in video classification, captioning, and clustering.
Abstract
Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale "Holistic Video Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 1
Figure 17
Figure 18
Figure 19
Figure 20| Task Category | Scene | Object | Action | Event | Attribute | Concept | Total |
|---|---|---|---|---|---|---|---|
| #Labels | 419 | 2651 | 877 | 149 | 160 | 122 | 4378 |
| #Annotations | 1,485,154 | 5,944,277 | 1,552,920 | 918,696 | 1,036,308 | 965,077 | 11,902,432 |
| #Videos | 366,941 | 480,821 | 481,418 | 320,428 | 368,668 | 375,664 | 481,418 |
| Dataset | Scene | Object | Action | Event | Attribute | Concept | HVU Overall |
|---|---|---|---|---|---|---|---|
| Machine-Generated HVU | 46.3 | 22.4 | 43.8 | 31.4 | 25.3 | 20.1 | 31.6 |
| Human-Annotation HVU | 50.1 | 27.9 | 46.7 | 35.7 | 29.2 | 23.2 | 35.4 |
| Training Labels | Action Recognition Performance |
|---|---|
| Action | 65.6 |
| Action + HVU | 68.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: 1KU Leuven, 2University of Bonn,3KIT, Karlsruhe, 4ETH Zürich, 5Sensifai
{firstname.lastname}@kuleuven.be, {lastname}@iai.uni-bonn.de, {firstname.lastname}@kit.edu, [email protected]
Supplementary Material:
Large Scale Holistic Video Understanding
Ali Diba1,5⋆
Mohsen Fayyaz2,⋆
Vivek Sharma3,⋆
Manohar Paluri
Jürgen Gall2
Rainer Stiefelhagen3
Luc Van Gool1,4,5
Appendix: This document provides supplementary material as mentioned in the main paper.
Appendix 0.A HVU Dataset
††*⋆*Ali Diba, Mohsen Fayyaz and Vivek Sharma contributed equally to this work and listed in alphabetical order.
0.A.1 Human Annotation Details
The row machine generated annotations consist almost 8K labels. The initial stage of human verification on validation set resulted in 4378 labels. And the final stage of complete human verification/modification process ended up in 3142 labels. In human annotation process, 80 new labels are added by human annotators.
In specific for the HVU human verification task, we employed three different teams (Team-A, Team-B and Team-C) of 55 human annotators. Team-A works on the taxonomy of the dataset. This team builds the taxonomy based on the visual meaning and definition of the tags obtained from APIs prediction. Team-B and Team-C are the verification teams and perform four tasks. The tasks they performs are: (a) verify the tags of videos by watching each video and flag false tags; (b) review the tags by watching the videos of each tag and flag the wrong videos; (c) add tags to the videos if some tags are missing; and (d) they suggest modification on tags such as, renaming or merging.
To make sure both Team-B and Team-C have a clear understanding of the tags and the corresponding videos, we ask them to use the provided tags definition from Team-A. For the aforementioned four tasks, Team-B goes through all the videos and provides the first round of clean annotations. Followed by this, Team-C reviews the annotations from Team-B to guarantee an accurate and cleaner version of annotations. Finally, Team-A reviews the suggestions provided from tasks (c) and (d) and apply them to the dataset. The verification process takes 100 seconds on average per video clip for a trained worker. It took about 8500 person-hours to firstly clean the machine-generated tags and remove errors and secondly add any possible missing labels from the dictionary. By incorporating the machine generated tags and human annotation, the HVU dataset covers a diverse set of tags with clean annotations. Using machine generated tags in the first step helps us to cover larger number of tags than a human can remember and label it in a reasonable time.
To make sure that we have a balanced distribution of samples per tag, we consider a minimum number of 50 samples.
To provide more details regarding the HVU human annotation process, we report the statistics of the different stages of the annotation process. Table 1 shows the statistics of the machine generated annotations of training set. Note, that the labels and categories are result of the initial human annotation process over the validation set of the dataset. The category with the highest number of labels and annotations is the object category. Concept is the category with the lowest number of labels. To have a better understanding of the statistics of the annotations we depict the distribution of categories with respect to the number of annotations, labels, and annotations per label in Figure 1. We can observe that the object category has the highest quota of labels and annotations, which is due to the abundance of objects in video. Despite having the highest quota of the labels and annotations, the object category does not have the highest annotations per label ratio. Figure 2 shows the percentage of the different subsets of the main categories. There are 50 different sets of videos based on assigned semantic categories. About of the videos have all of the categories.
0.A.2 Effect of Human Annotation
To present the impact of human annotation process, we have evaluated both versions of the HVU with machine-generated tags and human-annotated tags. We have trained two 3D-ResNet18 for each set and the comparison came in Table 2.
0.A.3 HVU Samples
We present some samples of videos and their corresponding tags in Fig 3 and Fig 4.
0.A.4 Effect of Additional Categories on Kinetics
One of our arguments in our paper is about how more semantic categories like object, scene, etc can lead to learn effective video representation. We have shown results on the HVU dataset in the paper. Here, we provided the similar experiment for the Kinetics-600 as a subset of our HVU. We have compared performance of a 3D-ResNet18 trained on Kinetics videos with its action labels versus trained on full HVU labels for the same videos. For the evaluation, we have measured the performance on Kinetics action labels. It can be seen in Table 3 that having more semantic labels in the training for Kinetics, improves the action classification performance. It is due to the fact that HVU can bring more capabilities to the deep models for learning new visual features for understanding videos.
