Self-Contained Entity Discovery from Captioned Videos
Melika Ayoughi, Pascal Mettes, Paul Groth

TL;DR
This paper presents a novel method for discovering named entities in videos using only video content and captions, eliminating the need for external knowledge or annotations, and introduces new benchmarks for evaluation.
Contribution
The work proposes a three-stage, self-contained approach for entity discovery in videos from multimodal data, along with new benchmarks based on popular TV series.
Findings
Achieves entity recognition accuracy close to supervised methods.
Demonstrates effectiveness on new benchmarks derived from TV series.
Highlights challenges and future directions for self-contained visual entity discovery.
Abstract
This paper introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g. faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating faces with entity labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
