Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems
Everton Guimaraes, Nathalia Nascimento, Chandan Shivalingaiah, Asish Nelapati

TL;DR
This empirical study benchmarks four prominent LLMs on 150 LeetCode problems, analyzing their performance and complexity in code generation tasks across different difficulty levels to guide developers in model selection.
Contribution
The paper provides a systematic comparison of ChatGPT, Copilot, Gemini, and DeepSeek on LeetCode problems, highlighting their performance differences and practical implications.
Findings
ChatGPT shows consistent efficiency in execution time and memory usage.
Copilot and DeepSeek exhibit variability with increasing task complexity.
Gemini performs well on simpler tasks but needs more attempts on harder problems.
Abstract
Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks, including code generation, testing, and debugging. As these models become integral to development workflows, a systematic comparison of their performance is essential for optimizing their use in real world applications. This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties, generating solutions in Java and Python. We evaluate each model based on execution time, memory usage, and algorithmic complexity, revealing significant performance differences. ChatGPT demonstrates consistent efficiency in execution time and memory usage, while Copilot and DeepSeek show variability as task complexity increases. Gemini, although effective on simpler tasks, requires more attempts as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
