Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
Xiao Wang, Jianlong Wu, Zijia Lin, Fuzheng Zhang, Di Zhang, and, Liqiang Nie

TL;DR
This paper introduces the Video DataFlywheel framework, an iterative method that refines video annotations and controls noise to improve large-scale video-language understanding, addressing the data scarcity and quality-diversity trade-off.
Contribution
It proposes a novel iterative refinement framework with AdaTaiLr noise control, enhancing dataset quality and scalability for video-language pre-training.
Findings
Achieves a 3% performance boost over baselines.
Improves dataset quality with minimal diversity loss.
Enhances video question answering and retrieval tasks.
Abstract
Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully leverage useful information in multimodal video content (frames, tags, ASR transcripts, etc.) to refine the original annotations. Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
