CodeCSE: A Simple Multilingual Model for Code and Comment Sentence   Embeddings

Anthony Varkey; Siyuan Jiang; Weijing Huang

arXiv:2407.06360·cs.SE·July 10, 2024

CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

Anthony Varkey, Siyuan Jiang, Weijing Huang

PDF

Open Access 1 Repo

TL;DR

CodeCSE introduces a contrastive learning model that generates multilingual function and comment embeddings in a shared space, achieving zero-shot performance comparable to fine-tuned models in code search tasks.

Contribution

This paper presents CodeCSE, the first out-of-the-box multilingual model for function embeddings that performs competitively without language-specific fine-tuning.

Findings

01

CodeCSE achieves zero-shot performance comparable to fine-tuned models.

02

It effectively learns multilingual function and comment embeddings in a shared space.

03

Open-source implementation and pretrained models are publicly available.

Abstract

Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at https://github.com/emu-se/codecse and the pretrained model is available at the HuggingFace public hub: https://huggingface.co/sjiang1/codecse

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

emu-se/codecse
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Software Engineering Research · Hate Speech and Cyberbullying Detection