Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo, Ziwang Zhao, Min Yang, Zheming Yang, Minghui Qiu, Tao Wang,, Zhongyu Wei, Yanhao Wang, Cen Chen

TL;DR
Valley is a multi-modal foundation model that enhances video comprehension and instruction-following by integrating large language models with visual understanding, supported by new datasets and a two-phase training approach.
Contribution
The paper introduces Valley, a novel multi-modal model with datasets and training methods that improve joint video and language understanding capabilities.
Findings
Effective in diverse video-text tasks
Improves instruction-following in videos
Demonstrates strong performance in complex scenarios
Abstract
Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training…
Peer Reviews
Decision·Submitted to ICLR 2024
* Valley achieves the state-of-the-art performance of multiple video QA benchmarks MSVD-QA, MSRVTT-QA and ActivityNet-QA. * Valley collects a dataset of 100k videos with detailed caption and plans to release the dataset which will benefit the research community.
* It is not clear what is the technical novelty of the proposed method Valley. Throughout the introduction, related works, and method sections there is not statement that explains the technical difference distinct from the existing video-language models. * No ablation is provided other than the temporal modeling modules (v1, v2, v3), which also makes it difficult to judge what technical component mainly contributes to the performance. * Among the temporal modeling modules, what is the unique adv
- This is a simple and effective method. The paper is well-written and easy to follow. - In my humble opinion, this work could be one of the first to explore instruction tuning in the video domain. - Strong results on multiple benchmarks. - The constructed dataset should be a valuable resource to the community.
- While the data collection pipeline is well-formulated, this method requires very high-quality training data. Gathering high-quality video instruction data remains very challenging when aiming for large-scale training. This prohibits very large-scale training to significantly boosting the model quality, especially when it comes to the video domain where the video data is often sparse and requires a very large number of training data. - The direct integration of vision transformers and LLMs ma
- This paper gathers a 73k video-based instruction dataset with the help of ChatGPT. This is somewhat larger than the instruction datasets used in previous methods (e.g., VideoChat uses 11K video instruction data). Judging from the experimental results provided by the authors, the quality of this dataset appears to be quite good. - Experiments show that Valley excels in visual question answering and captioning, demonstrating optimal performance, strong zero-shot capability. It also generates con
- In terms of instruction dataset construction, there seems to be a lack of innovation and comparative experiments. It appears that a higher-quality data source was simply used to collect data, and then common methods were employed to construct the instruction dataset. This was combined with instruction datasets from previous methods to obtain a larger instruction dataset. The decent performance achieved by this method in quantitative analysis may reflect the quality of the instruction dataset t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition
MethodsALIGN
