FreeVA: Offline MLLM as Training-Free Video Assistant

Wenhao Wu

arXiv:2405.07798·cs.CV·June 11, 2024·1 cites

FreeVA: Offline MLLM as Training-Free Video Assistant

Wenhao Wu

PDF

Open Access 1 Repo

TL;DR

FreeVA demonstrates that an offline, image-based multimodal large language model can effectively perform zero-shot video question-answering without additional training, challenging the necessity of video instruction tuning.

Contribution

This work introduces FreeVA, a training-free approach that extends image-based MLLMs to videos, providing a strong baseline and revealing surprising insights about training and evaluation practices.

Findings

01

FreeVA outperforms some state-of-the-art video MLLMs in zero-shot QA.

02

Video instruction tuning with VideoInstruct-100K does not improve performance.

03

Evaluation metrics are affected by GPT API version changes.

Abstract

This paper undertakes an empirical study to revisit the latest advancements in Multimodal Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to extend existing image-based MLLM to the video domain in a training-free manner. The study provides an essential, yet must-know baseline, and reveals several surprising findings: 1) FreeVA, leveraging only offline image-based MLLM without additional training, excels in zero-shot video question-answering (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2) While mainstream video-based MLLMs typically initialize with an image-based MLLM (e.g., LLaVA) and then fine-tune using video instruction tuning, the study indicates that utilizing the widely adopted VideoInstruct-100K for video instruction tuning doesn't actually lead to better performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whwu95/freeva
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing