# A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification

**Authors:** Mrinal Kanti Dhar, Mou Deb, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Divyanshi Sood, Avneet Kaur, Charmy Parikh, Swetha Rapolu, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, Naghmeh Asadimanesh, Shiva Sankari Karuppiah, Scott A. Helgeson, Venkata S. Akshintala, Shivaram P. Arunachalam

PMC · DOI: 10.3390/jimaging11070243 · 2025-07-18

## TL;DR

This paper introduces a new deep learning model for analyzing medical videos, specifically for classifying gastrointestinal endoscopic videos by capturing both spatial and temporal features.

## Contribution

The paper proposes a novel 3D CNN with a new RPA block and P-scSE3D for spatiotemporal feature extraction in medical video classification.

## Key findings

- The model achieved high accuracy (0.933) and F1-score (0.935) in classifying upper and lower GI endoscopic videos.
- The integration of P-scSE3D improved the F1-score by 7%.
- The model used (2 + 1)D convolution to reduce computational complexity while maintaining performance.

## Abstract

Accurate analysis of medical videos remains a major challenge in deep learning (DL) due to the need for effective spatiotemporal feature mapping that captures both spatial detail and temporal dynamics. Despite advances in DL, most existing models in medical AI focus on static images, overlooking critical temporal cues present in video data. To bridge this gap, a novel DL-based framework is proposed for spatiotemporal feature extraction from medical video sequences. As a feasibility use case, this study focuses on gastrointestinal (GI) endoscopic video classification. A 3D convolutional neural network (CNN) is developed to classify upper and lower GI endoscopic videos using the hyperKvasir dataset, which contains 314 lower and 60 upper GI videos. To address data imbalance, 60 matched pairs of videos are randomly selected across 20 experimental runs. Videos are resized to 224 × 224, and the 3D CNN captures spatiotemporal information. A 3D version of the parallel spatial and channel squeeze-and-excitation (P-scSE) is implemented, and a new block called the residual with parallel attention (RPA) block is proposed by combining P-scSE3D with a residual block. To reduce computational complexity, a (2 + 1)D convolution is used in place of full 3D convolution. The model achieves an average accuracy of 0.933, precision of 0.932, recall of 0.944, F1-score of 0.935, and AUC of 0.933. It is also observed that the integration of P-scSE3D increased the F1-score by 7%. This preliminary work opens avenues for exploring various GI endoscopic video-based prospective studies.

## Full-text entities

- **Genes:** RPA1 (replication protein A1) [NCBI Gene 6117] {aka HSSB, MST075, PFBMFT6, REPA1, RF-A, RP-A}
- **Diseases:** interstitial abnormalities (MESH:D065167), pleural effusion (MESH:D010996), injury to (MESH:D014947), foot ulcer (MESH:D016523), AI (MESH:C538142), lung consolidation (MESH:D008171), tremors (MESH:D014202), fatalities (MESH:C565541), GI diseases (MESH:D005767), epileptic (MESH:D004827), PD (MESH:D010300), enteric neurologic disorders (MESH:D004751), XAI (MESH:C538243), DL (MESH:D007859), seizures (MESH:D012640), polyp (MESH:D011127), gastrointestinal, liver, and pancreatic diseases (MESH:D008107)
- **Chemicals:** Grad (-), P (MESH:D010758)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12295846/full.md

---
Source: https://tomesphere.com/paper/PMC12295846