Valley: Video Assistant with Large Language model Enhanced abilitY

Ruipu Luo; Ziwang Zhao; Min Yang; Zheming Yang; Minghui Qiu; Tao Wang,; Zhongyu Wei; Yanhao Wang; Cen Chen

arXiv:2306.07207·cs.CV·March 18, 2025·30 cites

Valley: Video Assistant with Large Language model Enhanced abilitY

Ruipu Luo, Ziwang Zhao, Min Yang, Zheming Yang, Minghui Qiu, Tao Wang,, Zhongyu Wei, Yanhao Wang, Cen Chen

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Valley is a multi-modal foundation model that enhances video comprehension and instruction-following by integrating large language models with visual understanding, supported by new datasets and a two-phase training approach.

Contribution

The paper introduces Valley, a novel multi-modal model with datasets and training methods that improve joint video and language understanding capabilities.

Findings

01

Effective in diverse video-text tasks

02

Improves instruction-following in videos

03

Demonstrates strong performance in complex scenarios

Abstract

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

* Valley achieves the state-of-the-art performance of multiple video QA benchmarks MSVD-QA, MSRVTT-QA and ActivityNet-QA. * Valley collects a dataset of 100k videos with detailed caption and plans to release the dataset which will benefit the research community.

Weaknesses

* It is not clear what is the technical novelty of the proposed method Valley. Throughout the introduction, related works, and method sections there is not statement that explains the technical difference distinct from the existing video-language models. * No ablation is provided other than the temporal modeling modules (v1, v2, v3), which also makes it difficult to judge what technical component mainly contributes to the performance. * Among the temporal modeling modules, what is the unique adv

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- This is a simple and effective method. The paper is well-written and easy to follow. - In my humble opinion, this work could be one of the first to explore instruction tuning in the video domain. - Strong results on multiple benchmarks. - The constructed dataset should be a valuable resource to the community.

Weaknesses

- While the data collection pipeline is well-formulated, this method requires very high-quality training data. Gathering high-quality video instruction data remains very challenging when aiming for large-scale training. This prohibits very large-scale training to significantly boosting the model quality, especially when it comes to the video domain where the video data is often sparse and requires a very large number of training data. - The direct integration of vision transformers and LLMs ma

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

- This paper gathers a 73k video-based instruction dataset with the help of ChatGPT. This is somewhat larger than the instruction datasets used in previous methods (e.g., VideoChat uses 11K video instruction data). Judging from the experimental results provided by the authors, the quality of this dataset appears to be quite good. - Experiments show that Valley excels in visual question answering and captioning, demonstrating optimal performance, strong zero-shot capability. It also generates con

Weaknesses

- In terms of instruction dataset construction, there seems to be a lack of innovation and comparative experiments. It appears that a higher-quality data source was simply used to collect data, and then common methods were employed to construct the instruction dataset. This was combined with instruction datasets from previous methods to obtain a larger instruction dataset. The decent performance achieved by this method in quantitative analysis may reflect the quality of the instruction dataset t

Code & Models

Repositories

rupertluo/valley
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition

MethodsALIGN