HVM-1: Large-scale video models pretrained with nearly 5000 hours of   human-like video data

A. Emin Orhan

arXiv:2407.18067·cs.CV·July 26, 2024

HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

A. Emin Orhan

PDF

Open Access 1 Repo 1 Models

TL;DR

HVM-1 introduces large-scale, human-like video models pretrained on 5000 hours of egocentric videos, demonstrating improved object representations and competitive performance in downstream tasks compared to models trained on shorter, action-oriented videos.

Contribution

The paper presents HVM-1, a novel large-scale video model pretrained on extensive human-like video data using ST-MAE, showing advantages over models trained on traditional short clips.

Findings

01

HVM-1 models perform competitively on downstream tasks.

02

HVM-1 learns more robust object representations.

03

Pretraining on human-like videos enhances temporal understanding.

Abstract

We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eminorhan/hvm-1
pytorchOfficial

Models

🤗
eminorhan/hvm-1
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Computer Graphics and Visualization Techniques

MethodsMasked autoencoder