UniCoder: Scaling Code Large Language Model via Universal Code
Tao Sun, Linzheng Chai, Jian Yang, Yuwei Yin, Hongcheng Guo, Jiaheng, Liu, Bing Wang, Liqun Yang, Zhoujun Li

TL;DR
UniCoder introduces a universal code intermediate representation to improve large language models' ability to generate code, significantly enhancing performance over previous prompting methods by leveraging structured pseudo-code.
Contribution
The paper proposes universal code as an intermediate representation for code generation, and trains UniCoder with multi-task learning on a new dataset, improving code quality and model performance.
Findings
UniCoder outperforms previous prompting methods significantly.
Universal code improves the structural understanding of code generation.
The approach enhances code quality through better intermediate representations.
Abstract
Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsFocus
