Skeleton-based Action Recognition through Contrasting Two-Stream Spatial-Temporal Networks
Chen Pang, Xuequan Lu, Lei Lyu

TL;DR
This paper introduces a novel parallel two-stream network combining GCN and Transformer for skeleton-based action recognition, utilizing contrastive learning and a cyclical focal loss to enhance feature representation and achieve state-of-the-art results.
Contribution
It proposes a contrastive GCN-Transformer network with parallel streams and a cyclical focal loss, improving action recognition by capturing diverse features and focusing on hard samples.
Findings
Achieves state-of-the-art accuracy on benchmark datasets.
Effectively captures both local topology and global joint relationships.
Enhances feature learning through contrastive paradigm and cyclical focal loss.
Abstract
For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action ``clapping hands''). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Medical Imaging and Analysis · Gait Recognition and Analysis
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer
