Exploiting VLM Localizability and Semantics for Open Vocabulary Action   Detection

Wentao Bao; Kai Li; Yuxiao Chen; Deep Patel; Martin Renqiang Min; Yu; Kong

arXiv:2411.10922·cs.CV·November 19, 2024

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu, Kong

PDF

Open Access 1 Repo

TL;DR

This paper introduces OpenMixer, a novel method leveraging large vision-language models for open-vocabulary action detection in videos, enabling recognition and localization of both seen and unseen actions.

Contribution

The paper proposes OpenMixer, combining spatial and temporal blocks with dynamic alignment, to achieve open-vocabulary action detection using pre-trained VLMs within a query-based transformer framework.

Findings

01

OpenMixer outperforms baselines on detecting seen and unseen actions.

02

Established new OVAD benchmarks under various settings.

03

Demonstrated strong generalization from pre-trained VLMs.

Abstract

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cogito2012/openmixer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification

MethodsAttention Is All You Need · Label Smoothing · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings