Open Vocabulary Multi-Label Video Classification

Rohit Gupta; Mamshad Nayeem Rizve; Jayakrishnan Unnikrishnan; Ashish Tawari; Son Tran; Mubarak Shah; Benjamin Yao; Trishul Chilimbi

arXiv:2407.09073·cs.CV·October 14, 2025

Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi

PDF

Open Access

TL;DR

This paper introduces a novel method for open vocabulary multi-label video classification by adapting pre-trained vision-language models with semantic guidance from large language models and temporal modeling, enabling recognition of multiple actions and entities in videos.

Contribution

It presents an end-to-end trainable architecture that prompts LLMs for soft attributes and integrates temporal modeling into CLIP for improved multi-label video understanding.

Findings

01

Effective recognition of multiple actions and objects in videos.

02

Superior performance on benchmark datasets.

03

Enhanced open vocabulary classification accuracy.

Abstract

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies

MethodsContrastive Language-Image Pre-training