CodeMixBench: Evaluating Large Language Models on Code Generation with   Code-Mixed Prompts

Manik Sheokand; Parth Sawant

arXiv:2505.05063·cs.LG·May 9, 2025

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Manik Sheokand, Parth Sawant

PDF

Open Access 1 Datasets

TL;DR

CodeMixBench is a new benchmark that evaluates large language models on their ability to generate code from multilingual, code-mixed prompts, revealing performance drops and challenges in multilingual code generation.

Contribution

It introduces CodeMixBench, a benchmark for assessing LLMs on code-mixed prompts across multiple languages, filling a gap in existing evaluation frameworks.

Findings

01

Performance drops with code-mixed prompts, especially for smaller models.

02

Higher code-mixing levels lead to greater performance degradation.

03

Benchmark highlights challenges in multilingual code generation.

Abstract

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ColdSlim/CodeMixBench
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Materials Science

MethodsSparse Evolutionary Training