FreeVA: Offline MLLM as Training-Free Video Assistant
Wenhao Wu

TL;DR
FreeVA demonstrates that an offline, image-based multimodal large language model can effectively perform zero-shot video question-answering without additional training, challenging the necessity of video instruction tuning.
Contribution
This work introduces FreeVA, a training-free approach that extends image-based MLLMs to videos, providing a strong baseline and revealing surprising insights about training and evaluation practices.
Findings
FreeVA outperforms some state-of-the-art video MLLMs in zero-shot QA.
Video instruction tuning with VideoInstruct-100K does not improve performance.
Evaluation metrics are affected by GPT API version changes.
Abstract
This paper undertakes an empirical study to revisit the latest advancements in Multimodal Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to extend existing image-based MLLM to the video domain in a training-free manner. The study provides an essential, yet must-know baseline, and reveals several surprising findings: 1) FreeVA, leveraging only offline image-based MLLM without additional training, excels in zero-shot video question-answering (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2) While mainstream video-based MLLMs typically initialize with an image-based MLLM (e.g., LLaVA) and then fine-tune using video instruction tuning, the study indicates that utilizing the widely adopted VideoInstruct-100K for video instruction tuning doesn't actually lead to better performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing
