All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

Tanzila Rahman; Renjie Liao; Leonid Sigal

arXiv:2604.12335·cs.CV·April 15, 2026

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

Tanzila Rahman, Renjie Liao, Leonid Sigal

PDF

TL;DR

This paper introduces a unified synthetic data pipeline for multimodal video understanding, enabling scalable data generation for multiple tasks and improving model reasoning abilities, which outperforms traditional methods on real datasets.

Contribution

The authors develop a versatile synthetic data generation framework supporting multiple tasks and introduce a VQA-based fine-tuning strategy to enhance reasoning in multimodal video models.

Findings

01

Models trained on synthetic data generalize well to real-world datasets.

02

Synthetic data training often outperforms traditional real-data training.

03

Unified pipeline supports diverse multimodal video tasks effectively.

Abstract

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.