Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets
MingZe Tang, Madiha Kazi

TL;DR
This paper compares vision transformers and CNNs for human action recognition on small COCO subsets, finding transformers significantly outperform CNNs and highlighting their data efficiency and interpretability advantages.
Contribution
It demonstrates the superior performance of Vision Transformers over CNNs in small data regimes for human action recognition and emphasizes explainability methods for model diagnosis.
Findings
Vision Transformer achieved 90% accuracy, outperforming CNNs (~35%)
Transformers localize pose-specific regions effectively
Explainability techniques reveal model focus differences
Abstract
This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p < 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems · Human Pose and Action Recognition
