VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna

TL;DR
VideoNet introduces a large-scale domain-specific action recognition dataset and benchmark, revealing current vision-language models' limitations and demonstrating the benefits of fine-tuning on specialized data.
Contribution
The paper presents the first large-scale training dataset for domain-specific actions and evaluates VLMs' performance, highlighting the importance of fine-tuning for improved action recognition.
Findings
VLMs perform poorly on the VideoNet benchmark, especially in open and few-shot settings.
Fine-tuning on the new dataset significantly improves model accuracy, surpassing open-weight models.
Humans outperform models in few-shot action recognition by a notable margin.
Abstract
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
