Large VLM-based Stylized Sports Captioning

Sauptik Dhar; Nicholas Buoncristiani; Joe Anakata; Haoyu Zhang; Michelle Munson

arXiv:2508.19295·cs.CV·August 28, 2025

Large VLM-based Stylized Sports Captioning

Sauptik Dhar, Nicholas Buoncristiani, Joe Anakata, Haoyu Zhang, Michelle Munson

PDF

TL;DR

This paper introduces a two-level fine-tuned LVLM pipeline that significantly improves the accuracy and style of sports image captions, demonstrating practical real-time application in live sports journalism.

Contribution

It presents a novel two-level fine-tuning approach for LVLMs to generate accurate, stylized sports captions, addressing limitations of existing models.

Findings

01

8-10% improvement in F1 score

02

2-10% enhancement in BERT score

03

Real-time captioning at 6 images per 3-5 seconds

Abstract

The advent of large (visual) language models (LLM / LVLM) have led to a deluge of automated human-like systems in several domains including social media content generation, search and recommendation, healthcare prognosis, AI assistants for cognitive tasks etc. Although these systems have been successfully integrated in production; very little focus has been placed on sports, particularly accurate identification and natural language description of the game play. Most existing LLM/LVLMs can explain generic sports activities, but lack sufficient domain-centric sports' jargon to create natural (human-like) descriptions. This work highlights the limitations of existing SoTA LLM/LVLMs for generating production-grade sports captions from images in a desired stylized format, and proposes a two-level fine-tuned LVLM pipeline to address that. The proposed pipeline yields an improvement > 8-10% in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.