# OM-VST: A video action recognition model based on optimized downsampling module combined with multi-scale feature fusion

**Authors:** Xiaozhong Geng, Cheng Chen, Ping Yu, Baijin Liu, Weixin Hu, Qipeng Liang, Xintong Zhang

PMC · DOI: 10.1371/journal.pone.0318884 · 2025-03-06

## TL;DR

This paper introduces OM-VST, a video action recognition model that improves accuracy and reduces training parameters through optimized downsampling and multi-scale feature fusion.

## Contribution

The novel OM-VST model combines an optimized downsampling module with multi-scale feature fusion for better video classification performance.

## Key findings

- OM-VST improves classification accuracy by 2.81% compared to existing models.
- The model reduces training parameters by 54.7%, decreasing training time and energy consumption.
- OM-VST outperforms VST, SlowFast, and TSM on a public dataset.

## Abstract

Video classification, as an essential task in computer vision, aims to identify and label video content using computer technology automatically. However, the current mainstream video classification models face two significant challenges in practical applications: first, the classification accuracy is not high, which is mainly attributed to the complexity and diversity of video data, including factors such as subtle differences between different categories, background interference, and illumination variations; and second, the number of model training parameters is too high resulting in longer training time and increased energy consumption. To solve these problems, we propose the OM-Video Swin Transformer (OM-VST) model. This model adds a multi-scale feature fusion module with an optimized downsampling module based on a Video Swin Transformer (VST) to improve the model’s ability to perceive and characterize feature information. To verify the performance of the OM-VST model, we conducted comparison experiments between it and mainstream video classification models, such as VST, SlowFast, and TSM, on a public dataset. The results show that the accuracy of the OM-VST model is improved by 2.81% while the number of parameters is reduced by 54.7%. This improvement significantly enhances the model’s accuracy in video classification tasks and effectively reduces the number of parameters during model training.

## Full-text entities

- **Genes:** ERVK-13 (endogenous retrovirus group K member 13) [NCBI Gene 100861467] {aka c3_D}, VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** VST (MESH:D002472)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

50 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11884693/full.md

---
Source: https://tomesphere.com/paper/PMC11884693