DocCGen: Document-based Controlled Code Generation
Sameer Pimparkhede, Mehant Kammakomati, Srikanth Tamilselvam, Prince, Kumar, Ashok Pon Kumar, Pushpak Bhattacharyya

TL;DR
DocCGen is a framework that improves structured code generation from natural language by leveraging library documentation to guide library selection and schema-based decoding, enhancing accuracy for domain-specific languages.
Contribution
It introduces a two-step process using documentation for library detection and schema constraints, addressing limitations of existing in-context learning and fine-tuning methods.
Findings
Consistently improves code accuracy across models and metrics
Reduces syntactic and semantic errors in structured code
Effective for complex languages like YAML and Bash
Abstract
Recent developments show that Large Language Models (LLMs) produce state-of-the-art performance on natural language (NL) to code generation for resource-rich general-purpose languages like C++, Java, and Python. However, their practical usage for structured domain-specific languages (DSLs) such as YAML, JSON is limited due to domain-specific schema, grammar, and customizations generally unseen by LLMs during pre-training. Efforts have been made to mitigate this challenge via in-context learning through relevant examples or by fine-tuning. However, it suffers from problems, such as limited DSL samples and prompt sensitivity but enterprises maintain good documentation of the DSLs. Therefore, we propose DocCGen, a framework that can leverage such rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process. First, it detects the correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques
MethodsLib
