ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam; Vincent Tao Hu; Bj\"orn Ommer

arXiv:2506.22967·cs.CV·October 21, 2025

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam, Vincent Tao Hu, Bj\"orn Ommer

PDF

2 Repos

TL;DR

ActAlign is a zero-shot, training-free method that aligns language-generated sub-action sequences with video frames using DTW, enabling fine-grained video classification without supervision and outperforming larger models.

Contribution

It introduces a novel sequence alignment approach leveraging language priors for zero-shot fine-grained video classification without training.

Findings

01

Achieves 30.5% accuracy on ActionAtlas benchmark.

02

Outperforms larger video-language models with fewer parameters.

03

Demonstrates effectiveness across diverse sports actions.

Abstract

We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.