TL;DR
MATCH introduces a contrastive learning-based, reference-free metric for evaluating code quality, effectively correlating with functional correctness and human preferences across languages, addressing limitations of existing evaluation methods.
Contribution
The paper presents MATCH, a novel contrastive learning approach for reference-free code evaluation that improves correlation with correctness and preferences over prior metrics.
Findings
MATCH outperforms existing metrics in correlation with functional correctness.
MATCH achieves higher alignment with human judgment.
The method is effective across multiple programming languages.
Abstract
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
