TL;DR
This paper introduces CodeMMR, a unified multimodal retrieval model that embeds natural language, code, and images into a shared space, improving code search and generation across visual and textual modalities.
Contribution
It presents the first comprehensive benchmark for multimodal code IR and proposes a model that outperforms baselines in cross-modal retrieval and enhances code generation fidelity.
Findings
CodeMMR outperforms baselines by 10 points on nDCG@10.
It generalizes well across multiple modalities and programming languages.
Integrating CodeMMR into RAG improves code generation and visual grounding.
Abstract
Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrieval-augmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding. Yet existing code IR models remain largely text-centric and often overlook the visual and structural aspects inherent in programming artifacts such as web interfaces, data visualizations, SVGs, schematic diagrams, and UML. To bridge this gap, we introduce MMCoIR, the first comprehensive benchmark for evaluating multimodal code IR across five visual domains, eight programming languages, eleven libraries, and show the challenge of the task through extensive evaluation. Therefore, we then propose CodeMMR, a unified retrieval model that jointly embeds natural language, code, and images into a shared semantic space through instruction-based multimodal alignment. CodeMMR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
