NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional   Correctness

Manav Singhal; Tushar Aggarwal; Abhijeet Awasthi; Nagarajan Natarajan,; Aditya Kanade

arXiv:2401.15963·cs.SE·October 1, 2024·1 cites

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan,, Aditya Kanade

PDF

Open Access 1 Datasets

TL;DR

NoFunEval introduces a new benchmark to evaluate code language models on non-functional requirements like security and efficiency, revealing their limitations beyond functional correctness and questioning their understanding of real-world software needs.

Contribution

The paper presents NoFunEval, a novel benchmark for assessing code LMs on non-functional requirements and introduces the Coding Concepts prompting method for better domain knowledge communication.

Findings

01

Code LMs perform poorly on non-functional requirement tasks.

02

Even on functional correctness tasks, classification accuracy is surprisingly low.

03

The results highlight fundamental blindspots in current training setups of code LMs.

Abstract

Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ManavSinghal157/NoFunEval
dataset· 177 dl
177 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software Engineering Techniques and Practices

MethodsFocus