LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group   Activity Recognition

Naga Venkata Sai Raviteja Chappa; Khoa Luu

arXiv:2410.21108·cs.CV·December 11, 2024

LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

Naga Venkata Sai Raviteja Chappa, Khoa Luu

PDF

Open Access

TL;DR

LiGAR introduces a hierarchical transformer model that effectively integrates LiDAR, visual, and textual data for improved multi-modal group activity recognition, demonstrating state-of-the-art results across multiple datasets.

Contribution

The paper presents a novel LiDAR-guided hierarchical transformer architecture that enhances multi-modal group activity recognition by capturing multi-scale spatial and semantic information.

Findings

01

Achieves up to 10.6% improvement in F1-score on JRDB-PAR

02

Improves Mean Per Class Accuracy by 5.9% on NBA dataset

03

Maintains high performance without LiDAR data during inference

Abstract

Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Multi-Head Attention · Softmax · Adam