Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features
Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin, Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross

TL;DR
This paper introduces a simple approach for open-vocabulary temporal action detection in videos using pretrained image-text co-embeddings, achieving competitive results without training on video data.
Contribution
It demonstrates that image-text co-embeddings, trained on static images, can effectively enable open-vocabulary action detection in videos, and proposes a new evaluation setting based on category similarity.
Findings
Image-text co-embeddings enable competitive open-vocabulary detection.
Combining image-text features with motion or audio features improves performance.
A new evaluation setting based on category similarity is proposed.
Abstract
Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
