Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan, Yang, Jianlong Fu, Baining Guo

TL;DR
This paper introduces HD-VILA, a high-resolution, diversified video-language pre-training model that leverages a large-scale dataset to significantly improve performance across multiple visual and language understanding tasks.
Contribution
The paper presents a novel high-resolution, diversified dataset and a hybrid Transformer model for joint video and language pre-training, achieving state-of-the-art results.
Findings
40.4% relative increase in zero-shot MSR-VTT text-to-video retrieval R@1
55.4% improvement in high-resolution LSMDC retrieval
Effective in text-to-visual editing and super-resolution tasks
Abstract
We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dense Connections · Softmax
