TL;DR
This paper introduces USV, a large dataset of user-generated short videos, along with new tasks and baseline methods for high-level semantic understanding, aiming to advance research beyond instance recognition.
Contribution
The paper presents the USV dataset, defines new high-level semantic tasks, and proposes baseline models for topic recognition and video-text retrieval.
Findings
USV contains approximately 224,000 videos from user-generated content platforms.
Baseline models MMF-Net and VTCL effectively address the new tasks.
Comprehensive benchmarks are provided to guide future research.
Abstract
Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
