Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
Yan Zhu, Junbao Zhuo, Bin Ma, Jiajia Geng, Xiaoming Wei, Xiaolin Wei,, Shuhui Wang

TL;DR
This paper introduces an orthogonal temporal interpolation method for zero-shot video recognition that improves the use of temporal features in vision-language models, leading to better recognition accuracy on unseen categories.
Contribution
It proposes a novel orthogonal temporal interpolation module and matching loss to enhance temporal feature learning in zero-shot video recognition, outperforming previous methods.
Findings
OTI outperforms state-of-the-art on Kinetics-600, UCF101, HMDB51
Orthogonal temporal features improve zero-shot recognition accuracy
Refined spatial-temporal features are more effective than spatial-only features
Abstract
Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Cancer-related molecular mechanisms research
