DocCGen: Document-based Controlled Code Generation

Sameer Pimparkhede; Mehant Kammakomati; Srikanth Tamilselvam; Prince; Kumar; Ashok Pon Kumar; Pushpak Bhattacharyya

arXiv:2406.11925·cs.SE·July 4, 2024·1 cites

DocCGen: Document-based Controlled Code Generation

Sameer Pimparkhede, Mehant Kammakomati, Srikanth Tamilselvam, Prince, Kumar, Ashok Pon Kumar, Pushpak Bhattacharyya

PDF

Open Access 1 Video

TL;DR

DocCGen is a framework that improves structured code generation from natural language by leveraging library documentation to guide library selection and schema-based decoding, enhancing accuracy for domain-specific languages.

Contribution

It introduces a two-step process using documentation for library detection and schema constraints, addressing limitations of existing in-context learning and fine-tuning methods.

Findings

01

Consistently improves code accuracy across models and metrics

02

Reduces syntactic and semantic errors in structured code

03

Effective for complex languages like YAML and Bash

Abstract

Recent developments show that Large Language Models (LLMs) produce state-of-the-art performance on natural language (NL) to code generation for resource-rich general-purpose languages like C++, Java, and Python. However, their practical usage for structured domain-specific languages (DSLs) such as YAML, JSON is limited due to domain-specific schema, grammar, and customizations generally unseen by LLMs during pre-training. Efforts have been made to mitigate this challenge via in-context learning through relevant examples or by fine-tuning. However, it suffers from problems, such as limited DSL samples and prompt sensitivity but enterprises maintain good documentation of the DSLs. Therefore, we propose DocCGen, a framework that can leverage such rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process. First, it detects the correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DocCGen: Document-based Controlled Code Generation· underline

Taxonomy

TopicsModel-Driven Software Engineering Techniques

MethodsLib