Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark

Go Frendi Gunawan; Mukhlis Amien

arXiv:2602.07079·cs.SE·February 10, 2026

Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark

Go Frendi Gunawan, Mukhlis Amien

PDF

Open Access

TL;DR

This paper evaluates 11 large language models across five software engineering tasks, revealing significant efficiency variations and identifying key inefficiency patterns, with comprehensive data and tools released for reproducibility.

Contribution

It provides a comprehensive multi-task benchmark for LLMs in software engineering, highlighting efficiency disparities and inefficiency patterns not previously documented.

Findings

01

Models with perfect scores vary 22x in time

02

No correlation between tool usage and success

03

Coding tasks achieve 100% success, research tasks 90.9%

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering, yet comprehensive benchmarks covering diverse SE activities remain limited. We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. Our automated verification framework measures both output quality and completion efficiency. Key findings reveal that (1) models achieving identical perfect scores exhibit 22x variation in completion time, 49x variation in tool efficiency, and 53x variation in estimated cost; (2) tool usage frequency shows no correlation with success (r = 0.077, p = 0.575) - one model used 917 tool calls while another solved the same task with 3 calls; (3) we identify two distinct inefficiency patterns: loop inefficiency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software Testing and Debugging Techniques