All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
Tanzila Rahman, Renjie Liao, Leonid Sigal

TL;DR
This paper introduces a unified synthetic data pipeline for multimodal video understanding, enabling scalable data generation for multiple tasks and improving model reasoning abilities, which outperforms traditional methods on real datasets.
Contribution
The authors develop a versatile synthetic data generation framework supporting multiple tasks and introduce a VQA-based fine-tuning strategy to enhance reasoning in multimodal video models.
Findings
Models trained on synthetic data generalize well to real-world datasets.
Synthetic data training often outperforms traditional real-data training.
Unified pipeline supports diverse multimodal video tasks effectively.
Abstract
Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
