Skill over Scale: The Case for Medium, Domain-Specific Models for SE
Manisha Mukherjee, Vincent J. Hellendoorn

TL;DR
This paper demonstrates that modestly sized, domain-specific language models trained with best practices on StackOverflow data can outperform larger generalist models on code-related tasks, offering a cost-effective alternative.
Contribution
The authors show that well-trained medium-sized domain-specific models can outperform larger models, emphasizing the importance of training practices and data alignment for code tasks.
Findings
SOBert models outperform larger generalist models on code labeling tasks.
Proper training on in-domain data can rival or surpass larger models.
Affordable models like SOBert are publicly available for research and development.
Abstract
Recent advancements in AI have sparked a trend in constructing large, generalist language models that handle a multitude of tasks, including many code-related ones. While these models are expensive to train and are often closed-source, they have enjoyed broad adoption because they tend to outperform smaller, domain-specific models of code. In this work, we argue that this is not a foregone conclusion. We show that modestly sized domain-specific models can outperform much larger ones on code labeling tasks, provided they are trained to the same standards. Concretely, we focus on StackOverflow (SO), which offers large volumes of aligned code and text data. We align established best-practices for pre-training large language models with properties of SO as a data source, especially using a large context window (2,048 tokens), coupled with a powerful toolkit (Megatron-LM) to train two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Linear Decay · Attention Dropout · Layer Normalization · Byte Pair Encoding · RoBERTa · Softmax
