CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings
Anthony Varkey, Siyuan Jiang, Weijing Huang

TL;DR
CodeCSE introduces a contrastive learning model that generates multilingual function and comment embeddings in a shared space, achieving zero-shot performance comparable to fine-tuned models in code search tasks.
Contribution
This paper presents CodeCSE, the first out-of-the-box multilingual model for function embeddings that performs competitively without language-specific fine-tuning.
Findings
CodeCSE achieves zero-shot performance comparable to fine-tuned models.
It effectively learns multilingual function and comment embeddings in a shared space.
Open-source implementation and pretrained models are publicly available.
Abstract
Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at https://github.com/emu-se/codecse and the pretrained model is available at the HuggingFace public hub: https://huggingface.co/sjiang1/codecse
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Hate Speech and Cyberbullying Detection
