Large VLM-based Stylized Sports Captioning
Sauptik Dhar, Nicholas Buoncristiani, Joe Anakata, Haoyu Zhang, Michelle Munson

TL;DR
This paper introduces a two-level fine-tuned LVLM pipeline that significantly improves the accuracy and style of sports image captions, demonstrating practical real-time application in live sports journalism.
Contribution
It presents a novel two-level fine-tuning approach for LVLMs to generate accurate, stylized sports captions, addressing limitations of existing models.
Findings
8-10% improvement in F1 score
2-10% enhancement in BERT score
Real-time captioning at 6 images per 3-5 seconds
Abstract
The advent of large (visual) language models (LLM / LVLM) have led to a deluge of automated human-like systems in several domains including social media content generation, search and recommendation, healthcare prognosis, AI assistants for cognitive tasks etc. Although these systems have been successfully integrated in production; very little focus has been placed on sports, particularly accurate identification and natural language description of the game play. Most existing LLM/LVLMs can explain generic sports activities, but lack sufficient domain-centric sports' jargon to create natural (human-like) descriptions. This work highlights the limitations of existing SoTA LLM/LVLMs for generating production-grade sports captions from images in a desired stylized format, and proposes a two-level fine-tuned LVLM pipeline to address that. The proposed pipeline yields an improvement > 8-10% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
