OOP: Object-Oriented Programming Evaluation Benchmark for Large Language   Models

Shuai Wang; Liang Ding; Li Shen; Yong Luo; Bo Du; Dacheng Tao

arXiv:2401.06628·cs.CL·February 22, 2024·2 cites

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark and evaluation metric focused on object-oriented programming in large language models, revealing significant gaps in current models' OOP capabilities.

Contribution

The study presents the first OOP-specific benchmark with a novel pass@o metric, and evaluates 23 LLMs, highlighting the need for improved OOP understanding in models.

Findings

01

pass@o provides a better assessment for OOP code generation

02

Code-specialized LLMs underperform in OOP tasks compared to general models

03

All advanced LLMs perform poorly on the OOP benchmark

Abstract

Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alphadl/oop-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Parallel Computing and Optimization Techniques · Software System Performance and Reliability