High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
Yongpeng Cao, Yuji Yamakawa

TL;DR
This paper demonstrates that higher temporal resolution in video significantly enhances zero-shot semantic understanding of rapid human actions, using a training-free approach with pretrained models.
Contribution
It introduces a pipeline combining pretrained video-language models and language reasoning to analyze high-speed actions without training, emphasizing the importance of temporal resolution.
Findings
Higher frame rates improve semantic separability in zero-shot action recognition.
High-speed video yields more stable and interpretable semantic representations.
Temporal resolution is crucial for training-free understanding of rapid motions.
Abstract
Understanding human actions from visual observations is essential for human--robot interaction, particularly when semantic interpretation of unfamiliar or hard-to-annotate actions is required. In scenarios such as rapid and less common activities, collecting sufficient labeled data for supervised learning is challenging, making zero-shot approaches a practical alternative for semantic understanding without task-specific training. While recent advances in large-scale pretrained models enable such zero-shot reasoning, the impact of temporal resolution, especially for rapid and fine-grained motions, remains underexplored. In this study, we investigate how temporal resolution affects zero-shot semantic understanding of high-speed human actions. Using kendo as a representative case of rapid and subtle motion patterns, we propose a training-free pipeline that combines a pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
