TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go,, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim,, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon, Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee

TL;DR
This paper introduces TWLV-I, a new video foundation model that offers robust visual representations for both motion and appearance, along with a comprehensive evaluation framework to fairly compare video models.
Contribution
We present TWLV-I, a novel video foundation model with improved capabilities, and a carefully designed evaluation framework for fair and robust comparison of video models.
Findings
TWLV-I outperforms existing models on action recognition benchmarks.
Pretrained on publicly accessible datasets, TWLV-I shows significant accuracy improvements.
Evaluation code and embeddings are publicly available for reproducibility.
Abstract
In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
