OODEval: Evaluating Large Language Models on Object-Oriented Design

Bingxu Xiao; Yunwei Dong; Yiqi Tang; Manqing Zhang; Yifan Zhou; Chunyan Ma; Yepang Liu

arXiv:2601.07602·cs.SE·March 12, 2026

OODEval: Evaluating Large Language Models on Object-Oriented Design

Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, Yepang Liu

PDF

Open Access

TL;DR

This paper introduces OODEval, a comprehensive benchmark and evaluation framework for assessing large language models' capabilities in object-oriented design tasks, highlighting their strengths and weaknesses compared to human designers.

Contribution

The paper presents OODEval and OODEval-Human benchmarks, along with CLUE metrics, to systematically evaluate LLMs on object-oriented design, addressing a significant gap in software engineering evaluation.

Findings

01

LLMs achieve high syntactic accuracy but lack semantic understanding.

02

Qwen3-Coder-30B performs best among evaluated models.

03

Models are below top human designers in quality, with common failure modes identified.

Abstract

Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored. To fill this gap, we conduct a comprehensive empirical study evaluating 29 LLMs on object-oriented design (OOD) tasks. Owing to the lack of standardized benchmarks and metrics, we introduce OODEval, a manually constructed benchmark comprising 50 OOD tasks of varying difficulty, and OODEval-Human, the first human-rated OOD benchmark, which includes 940 undergraduate-submitted class diagrams evaluated by instructors. We further propose CLUE (Class Likeness Unified Evaluation), a unified metric set that assesses both global correctness and fine-grained design quality in class diagram generation. Using these benchmarks and metrics, we investigate five research questions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Materials Science