CodeFusion: A Pre-trained Diffusion Model for Code Generation
Mukul Singh, Jos\'e Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu,, Gust Verbruggen

TL;DR
CodeFusion is a pre-trained diffusion model for code generation that improves the ability to reconsider earlier tokens, performing competitively with larger auto-regressive models across multiple programming languages.
Contribution
It introduces a diffusion-based approach for code generation conditioned on natural language, addressing limitations of auto-regressive models in reconsidering earlier tokens.
Findings
Performs on par with larger auto-regressive models in top-1 accuracy
Outperforms in top-3 and top-5 accuracy due to better diversity and quality balance
Effective across Bash, Python, and Excel conditional formatting tasks
Abstract
Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Parallel Computing and Optimization Techniques · Machine Learning and Data Classification
MethodsDiffusion
