Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets

MingZe Tang; Madiha Kazi

arXiv:2506.11678·cs.CV·June 16, 2025

Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets

MingZe Tang, Madiha Kazi

PDF

Open Access

TL;DR

This paper compares vision transformers and CNNs for human action recognition on small COCO subsets, finding transformers significantly outperform CNNs and highlighting their data efficiency and interpretability advantages.

Contribution

It demonstrates the superior performance of Vision Transformers over CNNs in small data regimes for human action recognition and emphasizes explainability methods for model diagnosis.

Findings

01

Vision Transformer achieved 90% accuracy, outperforming CNNs (~35%)

02

Transformers localize pose-specific regions effectively

03

Explainability techniques reveal model focus differences

Abstract

This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p < 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems · Human Pose and Action Recognition