Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

Fatemeh Ziaeetabar; Florentin W\"org\"otter

arXiv:2505.15192·cs.CV·October 8, 2025

Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

Fatemeh Ziaeetabar, Florentin W\"org\"otter

PDF

Open Access

TL;DR

This paper presents a novel multimodal graph-based framework that combines foundation models like VideoMAE and BERT with dynamic graph reasoning to improve fine-grained action recognition in videos.

Contribution

It introduces an adaptive, dynamic graph architecture integrating multimodal representations and a task-specific attention mechanism for enhanced action understanding.

Findings

01

Outperforms state-of-the-art baselines on benchmark datasets

02

Effectively captures spatiotemporal and semantic relationships

03

Demonstrates robustness and generalization in action recognition

Abstract

Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Linear Layer · Residual Connection · Weight Decay · Dropout