Advancing High-Resolution Video-Language Representation with Large-Scale   Video Transcriptions

Hongwei Xue; Tiankai Hang; Yanhong Zeng; Yuchong Sun; Bei Liu; Huan; Yang; Jianlong Fu; Baining Guo

arXiv:2111.10337·cs.CV·July 11, 2022·6 cites

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan, Yang, Jianlong Fu, Baining Guo

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces HD-VILA, a high-resolution, diversified video-language pre-training model that leverages a large-scale dataset to significantly improve performance across multiple visual and language understanding tasks.

Contribution

The paper presents a novel high-resolution, diversified dataset and a hybrid Transformer model for joint video and language pre-training, achieving state-of-the-art results.

Findings

01

40.4% relative increase in zero-shot MSR-VTT text-to-video retrieval R@1

02

55.4% improvement in high-resolution LSMDC retrieval

03

Effective in text-to-visual editing and super-resolution tasks

Abstract

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/xpretrain
pytorchOfficial

Datasets

yaolily/TimeChat-Online-139K
dataset· 3.1k dl
3.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dense Connections · Softmax