A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
Md. Afzalur Rahaman, Tahmid Rahman

TL;DR
This paper introduces a heterogeneous two-stream video action recognition framework that assigns modality-specific backbones and evaluates various fusion strategies, demonstrating improved accuracy on benchmark datasets.
Contribution
It proposes a novel dual-stream architecture with modality-specific backbones and a comprehensive fusion analysis, highlighting the importance of tailored fusion strategies for different dataset sizes.
Findings
Cross-attention fusion achieves 98.12% accuracy on UCF11.
Weighted fusion reaches 96.86% on UCF50.
Modality contributions vary with dataset complexity.
Abstract
Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
