TL;DR
VideoLT introduces a large-scale, long-tailed video dataset and demonstrates that existing image-based methods underperform on videos, leading to the development of FrameStack, a dynamic frame sampling technique that improves long-tailed video recognition.
Contribution
The paper presents VideoLT, a new long-tailed video dataset, and proposes FrameStack, a novel frame sampling method tailored for long-tailed video recognition.
Findings
State-of-the-art image long-tailed methods underperform on videos.
FrameStack improves classification accuracy in long-tailed video datasets.
Dynamic frame sampling balances class distribution effectively.
Abstract
Label distributions in real-world are oftentimes long-tailed and imbalanced, resulting in biased models towards dominant labels. While long-tailed recognition has been extensively studied for image classification tasks, limited effort has been made for video domain. In this paper, we introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition. Our VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution. Through extensive studies, we demonstrate that state-of-the-art methods used for long-tailed image recognition do not perform well in the video domain due to the additional temporal dimension in video data. This motivates us to propose FrameStack, a simple yet effective method for long-tailed video recognition task. In particular, FrameStack performs sampling at the frame-level in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
