Studying LLM Performance on Closed- and Open-source Data

Toufique Ahmed; Christian Bird; Premkumar Devanbu; Saikat Chakraborty

arXiv:2402.15100·cs.SE·February 26, 2024·5 cites

Studying LLM Performance on Closed- and Open-source Data

Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty

PDF

Open Access

TL;DR

This study compares the performance of Large Language Models on open-source versus proprietary code, revealing that language and identifier differences impact effectiveness, with potential improvements via in-context learning.

Contribution

It provides empirical insights into LLM performance on proprietary code, highlighting language-specific effects and suggesting in-context learning as a mitigation strategy.

Findings

01

Performance for C# remains consistent across OSS and proprietary code.

02

Performance for C++ significantly decreases in proprietary code.

03

Differences in identifiers contribute to performance gaps.

Abstract

Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications