CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts
Manik Sheokand, Parth Sawant

TL;DR
CodeMixBench is a new benchmark that evaluates large language models on their ability to generate code from multilingual, code-mixed prompts, revealing performance drops and challenges in multilingual code generation.
Contribution
It introduces CodeMixBench, a benchmark for assessing LLMs on code-mixed prompts across multiple languages, filling a gap in existing evaluation frameworks.
Findings
Performance drops with code-mixed prompts, especially for smaller models.
Higher code-mixing levels lead to greater performance degradation.
Benchmark highlights challenges in multilingual code generation.
Abstract
Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Materials Science
MethodsSparse Evolutionary Training
