TL;DR
ActAlign is a zero-shot, training-free method that aligns language-generated sub-action sequences with video frames using DTW, enabling fine-grained video classification without supervision and outperforming larger models.
Contribution
It introduces a novel sequence alignment approach leveraging language priors for zero-shot fine-grained video classification without training.
Findings
Achieves 30.5% accuracy on ActionAtlas benchmark.
Outperforms larger video-language models with fewer parameters.
Demonstrates effectiveness across diverse sports actions.
Abstract
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
