SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen

TL;DR
SportR is a comprehensive multimodal sports reasoning benchmark with diverse data and complex questions, designed to evaluate and advance large language models' understanding of sports through visual perception and rule-based reasoning.
Contribution
Introduces SportR, the first multi-sports benchmark with detailed reasoning chains, visual grounding, and multi-modal data to evaluate and improve multimodal large language models.
Findings
State-of-the-art models perform poorly on complex reasoning tasks.
Training improves performance but still leaves a significant gap.
Benchmark reveals the need for advanced multimodal reasoning capabilities.
Abstract
Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 4,789 images and 2,052 videos. To enable granular evaluation, we structure our benchmark around a…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper clearly articulates the gap between current sports benchmarks - either single-sport with detailed annotations OR multi-sport without fine-grained reasoning chains. SportR addresses both limitations simultaneously. - Fully human-authored CoT by 16 domain experts (including 2 NCAA Division I athletes) with rigorous quality control is a significant strength. - The coordinate-based grounding evaluation (Q5) is a novel visual grounding task.
- Missing human evaluation: No human agreement studies or inter-annotator reliability metrics for the evaluation itself - Visual grounding (Q5) only applies to SportsImage, not SportsVideo, and no temporal grounding annotations at all. - The evaluation use close source LLM as judge, the paper acknowledges self-preference bias but doesn't adequately address it.
1. Human-authored CoT rationales: The manual, expert-driven CoT annotations enhance the dataset’s reliability and interpretability, avoiding the noise of model-generated explanations. The listed procedures and training for annotators are rigorous and well justified. 2. Comprehensive multimodal design: Covering both image and video modalities allows for evaluation of both spatial and temporal reasoning. 3. Hierarchical QA framework: The progressive question design (from simple classification to r
1. Limited ablation and error analysis: The paper would benefit from deeper analysis of failure cases or qualitative examples that illustrate where models still fall short and potential reasons why. The authors successfully show that their tasks are difficult, but they don't provide insights into why the tasks are difficult for current MLLMs. 2. Visual Localization True Difficultly: For the visual grounding tasks that require predicting bounding box coordinates, it is unclear whether the low acc
- The paper features a progressive hierarchy of questions that systematically test reasoning depth—from simple identification to complex, multi-step tasks like penalty prediction. - The benchmark consists of 7,118 human-authored Chain-of-Thought (CoT) annotations for the most complex tasks, providing models with explicit examples of the required reasoning process. - Extensive experiments show that state-of-the-art models perform poorly on SportR's most difficult tasks.
- The author states that the paper is "the first large-scale, multi-sport benchmark specifically designed to evaluate core reasoning capabilities." However, I know that there are actually some existing sports-domain datasets and benchmarks, such as "Sportsu". Therefore, what are the key differences between the dataset proposed in this paper and the existing datasets? Is it merely the addition of Chain-of-Thought (CoT) and grounding annotations? - As shown in Table 1 and Table 2, after SFT and S
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
