VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Tanush Yadav; Mohammadreza Salehi; Jae Sung Park; Vivek Ramanujan; Hannaneh Hajishirzi; Yejin Choi; Ali Farhadi; Rohun Tripathi; Ranjay Krishna

arXiv:2605.02834·cs.CV·May 6, 2026

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna

PDF

1 Datasets

TL;DR

VideoNet introduces a large-scale domain-specific action recognition dataset and benchmark, revealing current vision-language models' limitations and demonstrating the benefits of fine-tuning on specialized data.

Contribution

The paper presents the first large-scale training dataset for domain-specific actions and evaluates VLMs' performance, highlighting the importance of fine-tuning for improved action recognition.

Findings

01

VLMs perform poorly on the VideoNet benchmark, especially in open and few-shot settings.

02

Fine-tuning on the new dataset significantly improves model accuracy, surpassing open-weight models.

03

Humans outperform models in few-shot action recognition by a notable margin.

Abstract

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

raivn/VideoNet
dataset· 13k dl
13k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.