High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions

Yongpeng Cao; Yuji Yamakawa

arXiv:2605.00496·cs.CV·May 4, 2026

High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions

Yongpeng Cao, Yuji Yamakawa

PDF

TL;DR

This paper demonstrates that higher temporal resolution in video significantly enhances zero-shot semantic understanding of rapid human actions, using a training-free approach with pretrained models.

Contribution

It introduces a pipeline combining pretrained video-language models and language reasoning to analyze high-speed actions without training, emphasizing the importance of temporal resolution.

Findings

01

Higher frame rates improve semantic separability in zero-shot action recognition.

02

High-speed video yields more stable and interpretable semantic representations.

03

Temporal resolution is crucial for training-free understanding of rapid motions.

Abstract

Understanding human actions from visual observations is essential for human--robot interaction, particularly when semantic interpretation of unfamiliar or hard-to-annotate actions is required. In scenarios such as rapid and less common activities, collecting sufficient labeled data for supervised learning is challenging, making zero-shot approaches a practical alternative for semantic understanding without task-specific training. While recent advances in large-scale pretrained models enable such zero-shot reasoning, the impact of temporal resolution, especially for rapid and fine-grained motions, remains underexplored. In this study, we investigate how temporal resolution affects zero-shot semantic understanding of high-speed human actions. Using kendo as a representative case of rapid and subtle motion patterns, we propose a training-free pipeline that combines a pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.