Video-STaR: Self-Training Enables Video Instruction Tuning with Any   Supervision

Orr Zohar; Xiaohan Wang; Yonatan Bitton; Idan Szpektor; Serena; Yeung-Levy

arXiv:2407.06189·cs.CV·July 9, 2024

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena, Yeung-Levy

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Video-STaR introduces a self-training method that leverages any labeled video dataset to enhance large vision language models' understanding and task adaptation, significantly improving performance on video QA and downstream tasks.

Contribution

It presents the first video self-training approach enabling LVLMs to utilize diverse labeled video datasets for instruction tuning.

Findings

01

Improved TempCompass performance by 10%.

02

Increased Kinetics700-QA accuracy by 20%.

03

Enhanced action quality assessment on FineDiving by 15%.

Abstract

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

orrzohar/Video-STaR
pytorchOfficial

Datasets

orrzohar/Video-STaR
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCollaborative Teaching and Inclusion