Instruction Tuning for Secure Code Generation

Jingxuan He; Mark Vero; Gabriela Krasnopolska; Martin Vechev

arXiv:2402.09497·cs.CR·July 15, 2024·2 cites

Instruction Tuning for Secure Code Generation

Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces SafeCoder, a method for instruction tuning language models to generate more secure code without sacrificing utility, addressing a critical security gap in existing models.

Contribution

SafeCoder is a security-centric fine-tuning approach that combines security and utility optimization, significantly improving code safety in language models.

Findings

01

Security improved by about 30%

02

Effective across various LMs and datasets

03

Maintains utility while enhancing security

Abstract

Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs' practical utility by training them to follow user instructions and human preferences. However, existing instruction tuning schemes overlook a crucial aspect: the security of generated code. As a result, even the state-of-the-art instruction-tuned LMs frequently produce unsafe code, posing significant security risks. In this work, we introduce SafeCoder to address this gap. SafeCoder performs security-centric fine-tuning using a diverse and high-quality dataset that we collected using an automated pipeline. We integrate the security fine-tuning with standard instruction tuning, to facilitate a joint optimization of both security and utility. Despite its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eth-sri/safecoder
noneOfficial

Datasets

LeTue09/train_safetycode_instruct_v1
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Teaching and Learning Programming · Security and Verification in Computing