Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller,, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, Michele, Tufano

TL;DR
This paper introduces the Copilot evaluation harness, a comprehensive framework for assessing the performance of LLM-guided IDE interactions across various programming tasks and languages, aiming to improve developer productivity.
Contribution
We present a new evaluation system with robust metrics for measuring LLM performance in IDE scenarios, covering multiple developer tasks and providing insights for future LLM development.
Findings
Evaluated three common LLMs using the new metrics.
Identified strengths and weaknesses of LLMs in different programming tasks.
Provided data to guide future LLM optimization in IDEs.
Abstract
The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Linear Layer · Byte Pair Encoding · Attention Dropout · Dropout · Dense Connections
