DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

Zhende Song; Chenchen Wang; Jiamu Sheng; Chi Zhang; Shengji Tang; Jiayuan Fan; Tao Chen

arXiv:2403.01422·cs.CV·August 12, 2025·2 cites

DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Shengji Tang, Jiayuan Fan, Tao Chen

PDF

Open Access 1 Models

TL;DR

DreamFrame introduces an automated framework for generating style-consistent keyframes and QA pairs, creating a large dataset to improve video understanding in LVLMs and enabling effective downstream fine-tuning.

Contribution

It presents a novel three-stage method for automatic dataset creation with style consistency, enhancing LVLM instruction tuning without extensive manual annotation.

Findings

01

DreamFrame dataset contains ~1k stylized videos and 100k QA pairs.

02

Fine-tuned LVLMs using DreamFrame outperform previous models on benchmarks.

03

DreamFrame-7B surpasses similar-sized LVLMs in various evaluations.

Abstract

Recent large vision-language models (LVLMs) for video understanding are primarily fine-tuned with various videos scraped from online platforms. Existing datasets, such as ActivityNet, require considerable human labor for structuring and annotation before effectively utilized for tuning LVLMs. While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging, as collecting and annotating task-specific videos is highly labor-intensive and time-consuming. To address this issue, we propose a three-stage framework named DreamFrame for automatically generating style-consistent keyframes and corresponding question-answer (QA) pairs to support LVLM instruction tuning. DreamFrame generates datasets in a movie-like manner. First, we utilize an LLM to generate structured movie plots including movie…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Alrightalright/DreamFrame-Related
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · Dropout · Softmax · Dense Connections · Label Smoothing · Adam