SV3.3B: A Sports Video Understanding Model for Action Recognition
Sai Varun Kodathala, Yashwanth Reddy Vutukoori, Rakesh Vunnam

TL;DR
SV3.3B is a lightweight, efficient sports video understanding model that combines novel sampling and self-supervised learning to accurately recognize and describe athletic actions with high detail and precision.
Contribution
The paper introduces SV3.3B, a novel 3.3B parameter model that integrates temporal motion sampling and self-supervised learning for on-device sports video analysis.
Findings
Outperforms larger models like GPT-4o in sports description accuracy
Achieves 29.2% improvement in validation metrics over GPT-4o
Demonstrates high information density and action complexity recognition
Abstract
This paper addresses the challenge of automated sports video analysis, which has traditionally been limited by computationally intensive models requiring server-side processing and lacking fine-grained understanding of athletic movements. Current approaches struggle to capture the nuanced biomechanical transitions essential for meaningful sports analysis, often missing critical phases like preparation, execution, and follow-through that occur within seconds. To address these limitations, we introduce SV3.3B, a lightweight 3.3B parameter video understanding model that combines novel temporal motion difference sampling with self-supervised learning for efficient on-device deployment. Our approach employs a DWT-VGG16-LDA based keyframe extraction mechanism that intelligently identifies the 16 most representative frames from sports sequences, followed by a V-DWT-JEPA2 encoder pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization
