Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Asim Unmesh; Kaki Ramesh; Mayank Patel; Rahul Jain; Karthik Ramani

arXiv:2602.21406·cs.CV·February 26, 2026

Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Asim Unmesh, Kaki Ramesh, Mayank Patel, Rahul Jain, Karthik Ramani

PDF

Open Access

TL;DR

This paper introduces a training-free method leveraging vision-language models for open-vocabulary zero-shot temporal action segmentation, enabling video segmentation into actions without extensive labeled datasets.

Contribution

It proposes a novel segmentation-by-classification pipeline and provides the first broad analysis of VLMs' suitability for open-vocabulary action segmentation.

Findings

01

Achieves strong zero-shot segmentation results on benchmarks

02

Systematic evaluation across 14 diverse VLMs

03

Demonstrates potential of VLMs for structured temporal understanding

Abstract

Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications