UCRBench: Benchmarking LLMs on Use Case Recovery
Shuyuan Xiao, Yiran Zhang, Weisong Sun, Xiaohong Chen, Yang Liu, Zhi Jin

TL;DR
This paper introduces UCRBench, a new benchmark for evaluating large language models on use case recovery from source code, highlighting their capabilities and limitations in real-world software systems.
Contribution
The paper presents a manually validated, code-aligned use case benchmark across nine projects and a hierarchical evaluation protocol for LLMs in use case generation.
Findings
LLMs can partially reconstruct system functionality.
Performance varies significantly across projects.
High omission rates and difficulty maintaining abstraction.
Abstract
Use cases are widely employed to specify functional requirements, yet existing benchmarks are scarce and face the risk of being misaligned with actual system behavior, similarly limiting the rigorous evaluation of large language models (LLMs) in generating use cases from source code. We address this gap by introducing code-aligned use case benchmarks, constructed through manual validation of both user-goal and subfunction use cases across nine real-world software projects. Using this benchmark, we conduct the first systematic study of LLMs and propose a hierarchical evaluation protocol that assesses actor correctness, name accuracy, path fidelity, and behavioral coverage. The results show that while LLMs can partially reconstruct system functionality, their performance varies significantly across projects, with particularly noticeable shortcomings in domain-specific and multi-module…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices
