Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Anindya Mondal; Sauradip Nag; Joaquin M Prada; Xiatian Zhu; Anjan; Dutta

arXiv:2307.10763·cs.CV·January 11, 2024·1 cites

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan, Dutta

PDF

Open Access 1 Repo

TL;DR

This paper introduces MSQNet, a transformer-based model that performs actor-agnostic multi-label action recognition by leveraging multi-modal data, eliminating the need for actor-specific pose estimation, and outperforming prior methods on multiple benchmarks.

Contribution

The paper presents a novel multi-modal semantic query network (MSQNet) that unifies actor types and multi-label action recognition without actor-specific design or pose estimation.

Findings

01

MSQNet outperforms prior actor-specific methods by up to 50% on benchmarks.

02

The approach effectively handles both human and animal actions in multi-label settings.

03

Eliminates the need for actor pose estimation, simplifying model design.

Abstract

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mondalanindya/msqnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Softmax · Multi-Head Attention · Dense Connections · Attention Is All You Need · Residual Connection · Layer Normalization · Vision Transformer · Focus