Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks
H.M. Shadman Tabib, Jaber Ahmed Deedar

TL;DR
This paper compares GPT-4o and LightGBM in assessing the difficulty of programming problems, revealing significant limitations of GPT-4o and emphasizing the importance of explicit features for trustworthy difficulty evaluation.
Contribution
It provides a systematic comparison between LLM-based and feature-based difficulty assessment methods, highlighting the shortcomings of GPT-4o and proposing insights for improving LLM trustworthiness in judging tasks.
Findings
LightGBM achieves 86% accuracy in difficulty classification.
GPT-4o reaches only 37.75% accuracy and overlooks numeric cues.
GPT-4o often mislabels synthetic Hard problems as Medium.
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Text Readability and Simplification · Topic Modeling
