Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu, Kong

TL;DR
This paper introduces OpenMixer, a novel method leveraging large vision-language models for open-vocabulary action detection in videos, enabling recognition and localization of both seen and unseen actions.
Contribution
The paper proposes OpenMixer, combining spatial and temporal blocks with dynamic alignment, to achieve open-vocabulary action detection using pre-trained VLMs within a query-based transformer framework.
Findings
OpenMixer outperforms baselines on detecting seen and unseen actions.
Established new OVAD benchmarks under various settings.
Demonstrated strong generalization from pre-trained VLMs.
Abstract
Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
MethodsAttention Is All You Need · Label Smoothing · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings
