Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

Qinglong Hu; Xialiang Tong; Mingxuan Yuan; Fei Liu; Zhichao Lu; and Qingfu Zhang

arXiv:2508.05433·cs.LG·March 11, 2026

Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, and Qingfu Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MLES, a novel method combining multimodal large language models with evolutionary search to discover transparent, human-aligned control policies that perform comparably to traditional reinforcement learning methods.

Contribution

It presents a new approach for programmatic control policy discovery using multimodal LLMs and evolutionary search, enabling transparent, adaptable policies without relying on predefined languages.

Findings

01

MLES achieves performance comparable to PPO in control tasks.

02

MLES produces transparent, human-readable control policies.

03

The method is scalable and facilitates knowledge transfer across tasks.

Abstract

Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

- The integration of visual behaviour-feedback into the policy-synthesis pipeline is a genuinely novel contribution, enabling richer signals beyond scalar metrics in MLLM-policy generation. - The paper is well-written and clearly laid out, with the framework, methodology, and experiments described in a comprehensible manner. - The experiments include a transparent evolution process, showing how policies evolve across generations and how visual evidence influences the search. I personally liked t

Weaknesses

- The novelty is limited: the proposed method seems to be a direct modification of existing work EoH by including visual clues. In my opinion, simply adding visual cues does not count as a major contribution—rather, it relies heavily on the intrinsic ability of the MLLM. - The experiments are limited and the potential for generalization is questionable. The authors test only on two tasks (Lunar Lander and Car Racing), which are relatively simple and for which simple code-based policies might exi

Reviewer 02Rating 4Confidence 3

Strengths

- The motivation for including visual signals to improve the generated programs is intuitive and reasonable, as images often consist of richer information than text alone. - Employing **Python** rather than a **Domain-Specific Language (DSL)** improves the method’s generalization capability, making it more adaptable to various domains and tasks.

Weaknesses

- The experiments are conducted in only two environments, which makes the results less convincing. - The exclusion of the policy distillation baseline appears insufficiently justified. Although the proposed method is API-based and described as cost-efficient, invoking LLMs around 2000 times is still non-trivial. Distilling a program from a trained policy seems like an approach that could offer a reasonable trade-off between environmental interactions and LLM queries, even if the performance may

Reviewer 03Rating 4Confidence 4

Strengths

1. Figure 5 is nice, and it would to nice have more like this in the appendix. Papers discussing programmatic policies usually only present they final intrepretable policies, but the procedure of discovering is missing. I am happy to see such a procedure with the help pf evoluationary search. 2. Ablation study is provided, this is nice. This is especially helpful for a work which invoke MLLM with different prompts (E1, E2, M1_M, M2_M). 3. The final policies, as well as the searching process are

Weaknesses

1. Usually this kind of framework is hard to generalize to other domains of tasks. This typically involve composing template carefully. The initial code template is good, but this also inject to much inductive bias. 2. The framework is highly dependent on MLLM's capabilities. Did you perform any ablation study that how the choices and settings of MLLMs could affect the overall performance? 3. Lunar Lander and Car Racing are important RL tasks, yet they are too simple. Did you ever apply this fr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling