Blockwise Temporal-Spatial Pathway Network
SeulGi Hong, Min-Kook Choi

TL;DR
This paper introduces BTSNet, a 3D-CNN model for video action recognition that adaptively adjusts spatial and temporal receptive fields, improving performance and interpretability across multiple datasets.
Contribution
The paper presents a novel blockwise temporal-spatial pathway network that adaptively selects receptive fields and fuses attention-based features for enhanced action recognition.
Findings
Achieved strong generalization on UCF-101, HMDB-51, SVW, and Epic-Kitchen datasets.
Provided interpretable visualizations of spatiotemporal attention.
Demonstrated improved representation for 3D convolutional blocks.
Abstract
Algorithms for video action recognition should consider not only spatial information but also temporal relations, which remains challenging. We propose a 3D-CNN-based action recognition model, called the blockwise temporal-spatial path-way network (BTSNet), which can adjust the temporal and spatial receptive fields by multiple pathways. We designed a novel model inspired by an adaptive kernel selection-based model, which is an architecture for effective feature encoding that adaptively chooses spatial receptive fields for image recognition. Expanding this approach to the temporal domain, our model extracts temporal and channel-wise attention and fuses information on various candidate operations. For evaluation, we tested our proposed model on UCF-101, HMDB-51, SVW, and Epic-Kitchen datasets and showed that it generalized well without pretraining. BTSNet also provides interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Digital Imaging for Blood Diseases · Advanced Neural Network Applications
