DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou; Haotian Xia; Zhen Ye; Shengjie Zhang; Christopher Lai; Vicente Ordonez; Weining Shen; Hanjie Chen

arXiv:2511.12908·cs.CV·March 13, 2026

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen

PDF

Open Access

TL;DR

DeepSport is an end-to-end multimodal large language model designed for multi-sport video understanding, employing active reasoning and reinforcement learning to outperform existing models on diverse sports video tasks.

Contribution

It introduces a novel end-to-end training framework with a large curated dataset and agentic reinforcement learning, enabling comprehensive multi-sport video reasoning.

Findings

01

Achieves state-of-the-art performance on a 6.7k benchmark.

02

Outperforms proprietary and open-source models with fewer frames.

03

Exhibits strong zero-shot transfer to unseen sports.

Abstract

Sports video understanding requires perceiving high-speed dynamics, complex rules, and long temporal contexts. Yet, current Multimodal Large Language Models (MLLMs) remain narrowly focused on single sports, specific tasks, or training-free paradigms. We introduce DeepSport, the first end-to-end trained MLLM for multi-task, multi-sport video understanding. DeepSport shifts from passive frame processing to active, iterative reasoning, dynamically extracting frames to "think with videos." To train our model, we curate a unified 78k-sample dataset via a rigorous three-step text-and-vision distillation pipeline. We then employ a progressive two-stage training strategy: a Sports Curriculum Supervised Fine-Tuning phase to build foundational perception, followed by Agentic Reinforcement Learning with a novel tool-use reward. Extensive experiments on a comprehensive 6.7k benchmark demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Advanced Technologies in Various Fields