Actor-agnostic Multi-label Action Recognition with Multi-modal Query
Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan, Dutta

TL;DR
This paper introduces MSQNet, a transformer-based model that performs actor-agnostic multi-label action recognition by leveraging multi-modal data, eliminating the need for actor-specific pose estimation, and outperforming prior methods on multiple benchmarks.
Contribution
The paper presents a novel multi-modal semantic query network (MSQNet) that unifies actor types and multi-label action recognition without actor-specific design or pose estimation.
Findings
MSQNet outperforms prior actor-specific methods by up to 50% on benchmarks.
The approach effectively handles both human and animal actions in multi-label settings.
Eliminates the need for actor pose estimation, simplifying model design.
Abstract
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Softmax · Multi-Head Attention · Dense Connections · Attention Is All You Need · Residual Connection · Layer Normalization · Vision Transformer · Focus
