Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding

Namho Kim; Junhwa Kim

arXiv:2507.03531·cs.CV·July 8, 2025

Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding

Namho Kim, Junhwa Kim

PDF

TL;DR

This paper introduces a multimodal framework that combines video, image, and text data using cross-attentive GRUs for improved fine-grained video understanding, demonstrating superior performance on challenging benchmarks.

Contribution

The novel framework integrates cross-attention with GRUs and employs feature augmentation, advancing multimodal fusion techniques for complex video analysis tasks.

Findings

01

Outperforms unimodal baselines on violence detection and valence-arousal estimation

02

Cross-attention improves modality integration and robustness

03

Feature augmentation enhances model generalization

Abstract

Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.