Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation
Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani

TL;DR
This paper introduces a training-free method leveraging vision-language models for open-vocabulary zero-shot temporal action segmentation, enabling video segmentation into actions without extensive labeled datasets.
Contribution
It proposes a novel segmentation-by-classification pipeline and provides the first broad analysis of VLMs' suitability for open-vocabulary action segmentation.
Findings
Achieves strong zero-shot segmentation results on benchmarks
Systematic evaluation across 14 diverse VLMs
Demonstrates potential of VLMs for structured temporal understanding
Abstract
Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications
