RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Yanlin Wang; Ziyao Zhang; Chong Wang; Xinyi Xu; Mingwei Liu; Yong Wang; Jiachi Chen; Zibin Zheng

arXiv:2601.22706·cs.CR·February 2, 2026

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Yanlin Wang, Ziyao Zhang, Chong Wang, Xinyi Xu, Mingwei Liu, Yong Wang, Jiachi Chen, Zibin Zheng

PDF

Open Access

TL;DR

This paper introduces RealSec-bench, a benchmark based on real-world Java repositories to evaluate LLMs' ability to generate secure code, revealing current models' limitations in balancing security and functionality.

Contribution

We present a novel benchmark constructed from real-world repositories, combining systematic vulnerability detection and expert validation, to evaluate secure code generation in LLMs.

Findings

01

RAG improves functional correctness but not security.

02

Prompting with security guidelines often causes compilation failures.

03

Current LLMs struggle to generate secure code while maintaining functionality.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic vulnerabilities or evaluating functional correctness in isolation, failing to capture the complex interplay between functionality and security found in real-world software. To address this gap, we introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories. Our methodology employs a multi-stage pipeline that combines systematic SAST scanning with CodeQL, LLM-based false positive elimination, and rigorous human expert validation. The resulting benchmark contains 105 instances grounded in real-word repository contexts, spanning 19 Common Weakness Enumeration (CWE) types and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Scientific Computing and Data Management