OM-VST: A video action recognition model based on optimized downsampling module combined with multi-scale feature fusion

Xiaozhong Geng; Cheng Chen; Ping Yu; Baijin Liu; Weixin Hu; Qipeng Liang; Xintong Zhang

PMC · DOI:10.1371/journal.pone.0318884·March 6, 2025

OM-VST: A video action recognition model based on optimized downsampling module combined with multi-scale feature fusion

Xiaozhong Geng, Cheng Chen, Ping Yu, Baijin Liu, Weixin Hu, Qipeng Liang, Xintong Zhang

PDF

Open Access

TL;DR

This paper introduces OM-VST, a video action recognition model that improves accuracy and reduces training parameters through optimized downsampling and multi-scale feature fusion.

Contribution

The novel OM-VST model combines an optimized downsampling module with multi-scale feature fusion for better video classification performance.

Findings

01

OM-VST improves classification accuracy by 2.81% compared to existing models.

02

The model reduces training parameters by 54.7%, decreasing training time and energy consumption.

03

OM-VST outperforms VST, SlowFast, and TSM on a public dataset.

Abstract

Video classification, as an essential task in computer vision, aims to identify and label video content using computer technology automatically. However, the current mainstream video classification models face two significant challenges in practical applications: first, the classification accuracy is not high, which is mainly attributed to the complexity and diversity of video data, including factors such as subtle differences between different categories, background interference, and illumination variations; and second, the number of model training parameters is too high resulting in longer training time and increased energy consumption. To solve these problems, we propose the OM-Video Swin Transformer (OM-VST) model. This model adds a multi-scale feature fusion module with an optimized downsampling module based on a Video Swin Transformer (VST) to improve the model’s ability to…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

ERVK-13 VIT

Proteins2

Species1

Homo sapiens(human · species)

Diseases1

VST

Figures50

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis