TL;DR
This paper introduces the novel task of extracting knowledge graphs directly from videos, aiming to improve data processing and evaluation over traditional natural language annotations, and proposes a method to generate datasets for this task.
Contribution
It defines the new task of knowledge graph extraction from videos, creates datasets for it, and presents an initial deep-learning model for this purpose.
Findings
Successful dataset generation from existing video annotations
Initial model demonstrates feasibility of knowledge graph extraction
Results on MSVD* and MSR-VTT* datasets show promising performance
Abstract
Nearly all existing techniques for automated video annotation (or captioning) describe videos using natural language sentences. However, this has several shortcomings: (i) it is very hard to then further use the generated natural language annotations in automated data processing, (ii) generating natural language annotations requires to solve the hard subtask of generating semantically precise and syntactically correct natural language sentences, which is actually unrelated to the task of video annotation, (iii) it is difficult to quantitatively measure performance, as standard metrics (e.g., accuracy and F1-score) are inapplicable, and (iv) annotations are language-specific. In this paper, we propose the new task of knowledge graph extraction from videos, i.e., producing a description in the form of a knowledge graph of the contents of a given video. Since no datasets exist for this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
