HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language   Models on cross-domain multi-file project problems

Jun Xing; Mayur Bhatia; Sahil Phulwani; Darshan Suresh; Rafik Matta

arXiv:2502.00226·cs.LG·February 4, 2025

HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems

Jun Xing, Mayur Bhatia, Sahil Phulwani, Darshan Suresh, Rafik Matta

PDF

Open Access

TL;DR

The HackerRank-ASTRA benchmark assesses large language models on multi-file, project-based coding problems, emphasizing real-world applicability and model consistency across multiple runs.

Contribution

It introduces a new project-based evaluation benchmark with consistency metrics and taxonomy-level analysis for LLMs in software development tasks.

Findings

01

Top models scored around 75% on average.

02

Claude-3.5-Sonnet-1022 showed highest consistency.

03

No significant performance difference among top models.

Abstract

Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Topic Modeling · Software Engineering Research

MethodsFocus