Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark
Go Frendi Gunawan, Mukhlis Amien

TL;DR
This paper evaluates 11 large language models across five software engineering tasks, revealing significant efficiency variations and identifying key inefficiency patterns, with comprehensive data and tools released for reproducibility.
Contribution
It provides a comprehensive multi-task benchmark for LLMs in software engineering, highlighting efficiency disparities and inefficiency patterns not previously documented.
Findings
Models with perfect scores vary 22x in time
No correlation between tool usage and success
Coding tasks achieve 100% success, research tasks 90.9%
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering, yet comprehensive benchmarks covering diverse SE activities remain limited. We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. Our automated verification framework measures both output quality and completion efficiency. Key findings reveal that (1) models achieving identical perfect scores exhibit 22x variation in completion time, 49x variation in tool efficiency, and 53x variation in estimated cost; (2) tool usage frequency shows no correlation with success (r = 0.077, p = 0.575) - one model used 917 tool calls while another solved the same task with 3 calls; (3) we identify two distinct inefficiency patterns: loop inefficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software Testing and Debugging Techniques
