TWLV-I: Analysis and Insights from Holistic Evaluation on Video   Foundation Models

Hyeongmin Lee; Jin-Young Kim; Kyungjune Baek; Jihwan Kim; Hyojun Go,; Seongsu Ha; Seokjin Han; Jiho Jang; Raehyuk Jung; Daewoo Kim; GeunOh Kim,; JongMok Kim; Jongseok Kim; Junwan Kim; Soonwoo Kwon; Jangwon Lee; Seungjoon; Park; Minjoon Seo; Jay Suh; Jaehyuk Yi; Aiden Lee

arXiv:2408.11318·cs.CV·August 26, 2024

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go,, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim,, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon, Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces TWLV-I, a new video foundation model that offers robust visual representations for both motion and appearance, along with a comprehensive evaluation framework to fairly compare video models.

Contribution

We present TWLV-I, a novel video foundation model with improved capabilities, and a carefully designed evaluation framework for fair and robust comparison of video models.

Findings

01

TWLV-I outperforms existing models on action recognition benchmarks.

02

Pretrained on publicly accessible datasets, TWLV-I shows significant accuracy improvements.

03

Evaluation code and embeddings are publicly available for reproducibility.

Abstract

In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

twelvelabs-io/video-embeddings-evaluation-framework
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging