Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong

TL;DR
This paper introduces CourtSI, a large-scale sports-specific spatial intelligence dataset and benchmark for vision-language models, revealing current limitations and improvements in spatial reasoning within sports scenarios.
Contribution
The paper presents the first large-scale sports-specific spatial intelligence dataset and benchmark, along with a semi-automatic scene reconstruction method and evaluation of VLMs.
Findings
Existing VLMs show a significant performance gap on CourtSI.
Fine-tuning improves accuracy by 23.5 percentage points.
Models generalize well to similar unseen sports.
Abstract
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
