Evaluating How Fine-tuning on Bimodal Data Effects Code Generation
Gabriel Orlanski, Seonhye Yang, Michael Healy

TL;DR
Fine-tuning language models on bimodal data from coding forums improves code generation performance and reduces errors, but higher temperatures can decrease program runnability, highlighting the need for better data integration methods.
Contribution
This paper introduces a bimodal dataset from StackOverflow for fine-tuning models, demonstrating significant performance gains and error reduction in code generation tasks.
Findings
54.64% pass@k improvement on HumanEval
85.35% pass@k improvement on MBP tasks
Higher temperatures decrease program runnability
Abstract
Despite the increase in popularity of language models for code generation, it is still unknown how training on bimodal coding forums affects a model's code generation performance and reliability. We, therefore, collect a dataset of over 2.2M StackOverflow questions with answers for finetuning. These fine-tuned models have average improvements of 54.64% and 85.35% on the HumanEval (Chen et al., 2021) and Mostly Basic Program Problems (Austin et al., 2021) tasks, respectively. This regime further decreases the number of generated programs with both syntax and runtime errors. However, we find that at higher temperatures, there are significant decreases to the model's ability to generate runnable programs despite higher scores, underscoring the need for better methods of incorporating such data that mitigate these side effects. The code can be found…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Engineering Techniques and Practices
