OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters
Zexin Chen, Chengxi Li, Xiangyu Xie, Parijat Dube

TL;DR
This paper introduces OnlySportsLM, a small yet high-performing sports-domain language model trained on a large dataset, demonstrating that domain-specific training can achieve state-of-the-art results comparable to larger models.
Contribution
The paper presents a novel approach to training a small, sports-specific language model with a large dataset, optimizing architecture, and establishing a comprehensive workflow for domain-specific AI development.
Findings
OnlySportsLM outperforms previous models with 37.62%/34.08% accuracy improvements.
The model matches larger models like SomlLM 1.7B and Qwen 1.5B in sports tasks.
A new dataset and benchmark facilitate efficient sports-domain language modeling.
Abstract
This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Testing and Debugging Techniques
