Leveraging Foundation Models for Multimodal Graph-Based Action Recognition
Fatemeh Ziaeetabar, Florentin W\"org\"otter

TL;DR
This paper presents a novel multimodal graph-based framework that combines foundation models like VideoMAE and BERT with dynamic graph reasoning to improve fine-grained action recognition in videos.
Contribution
It introduces an adaptive, dynamic graph architecture integrating multimodal representations and a task-specific attention mechanism for enhanced action understanding.
Findings
Outperforms state-of-the-art baselines on benchmark datasets
Effectively captures spatiotemporal and semantic relationships
Demonstrates robustness and generalization in action recognition
Abstract
Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Linear Layer · Residual Connection · Weight Decay · Dropout
