Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization
Zhi Chen, Lingxiao Jiang

TL;DR
This paper examines collaborative code generation models, balancing their effectiveness with privacy risks like memorization and data leakage, and compares centralized, federated, and incremental training methods.
Contribution
It provides a comprehensive analysis of collaborative training settings for code models, highlighting factors affecting performance and privacy, and offers practical recommendations for secure, effective collaboration.
Findings
Federated learning achieves performance comparable to centralized training.
Lower memorization ratios in federated models suggest better data privacy.
Cross-organizational code clones pose challenges in collaborative training.
Abstract
In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Advanced Software Engineering Methodologies · Software Engineering Techniques and Practices
