VLG: General Video Recognition with Web Textual Knowledge
Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, Limin Wang

TL;DR
This paper introduces VLG, a unified visual-linguistic framework leveraging web textual knowledge for general video recognition across diverse challenging settings, and establishes a comprehensive benchmark for this task.
Contribution
It proposes a novel two-stage training paradigm for GVR using external web text and creates a new benchmark dataset covering multiple recognition scenarios.
Findings
VLG achieves state-of-the-art results across all tested settings.
The framework demonstrates strong generalization and effectiveness.
The benchmark facilitates future research in general video recognition.
Abstract
Video recognition in an open and dynamic world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by an effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
