Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
Shukang Yin, Chaoyou Fu, Sirui Zhao, Chunjiang Ge, Yan Yang, Yuhan Dai, Yongdong Luo, Tong Xu, Caifeng Shan, Enhong Chen

TL;DR
This paper introduces Sparrow, a data augmentation technique that synthesizes video-like samples from text instructions to improve data efficiency in training video-LLMs, achieving comparable or superior performance with less data.
Contribution
The paper proposes Sparrow, a novel synthetic data augmentation method that enhances data efficiency in training video-LLMs by generating video-like samples from text instructions.
Findings
Synthetic samples improve training efficiency.
Performance comparable or better with less data.
Enhances long video understanding without long video data.
Abstract
Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTranslation Studies and Practices · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
