3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition
Lei Wang, Piotr Koniusz

TL;DR
The paper introduces 3Mformer, a novel multi-order multi-mode transformer that models higher-order relationships in skeletal data for action recognition, outperforming existing GCN and transformer models.
Contribution
It proposes a hypergraph-based approach with a multi-order transformer architecture that captures complex joint dependencies for improved skeletal action recognition.
Findings
Achieves state-of-the-art accuracy on benchmark datasets.
Effectively models higher-order joint dependencies.
Outperforms GCN and transformer-based methods.
Abstract
Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph neighbourhoods, and ignore the dependency between not linked body joints. We propose to form hypergraph to model hyper-edges between graph nodes (e.g., third- and fourth-order hyper-edges capture three and four nodes) which help capture higher-order motion patterns of groups of body joints. We split action sequences into temporal blocks, Higher-order Transformer (HoT) produces embeddings of each temporal block based on (i) the body joints, (ii) pairwise links of body joints and (iii) higher-order hyper-edges of skeleton body joints. We combine such HoT embeddings of hyper-edges of orders 1, ..., r by a novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose order can be exchanged to achieve coupled-mode attention on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Context-Aware Activity Recognition Systems · Anomaly Detection Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dropout · Dense Connections
