TL;DR
This paper presents MM-Coder, a multilingual multimodal model that integrates visual design diagrams with textual instructions to improve code generation, supported by a new dataset and benchmark addressing multimodal challenges.
Contribution
Introduction of MM-Coder, a multimodal code generation model that combines visual and textual inputs, along with MMc-Instruct dataset and MMEval benchmark for evaluation.
Findings
MM-Coder improves code accuracy with visual inputs
MMEval reveals challenges in visual information capture
Multimodal instructions enhance architectural alignment
Abstract
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
