Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs
Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas

TL;DR
Robo-Instruct introduces a method to fine-tune small code LLMs for robot tasks by dynamically creating simulation environments and refining instructions, improving task-program alignment and performance.
Contribution
The paper presents Robo-Instruct, a novel approach that synthesizes simulation environments during execution and refines instructions, enabling effective fine-tuning of small code LLMs for robotics tasks.
Findings
Fine-tuned models outperform baseline methods.
Models match or surpass larger proprietary models.
Effective simulation environment synthesis during task execution.
Abstract
Code LLMs have shown promising results with converting tasks in natural language to programs that can be executed by service robots. We are interested in finetuning small, specialized LLMs for this purpose, but collecting datasets of task-program pairs specific to each robot is time-consuming and expensive. While approaches such as SELF-INSTRUCT and EVOL-INSTRUCT are capable of generating novel tasks given a few examples, they are unable to provide the corresponding programs that correctly abide by physical-world and robot-constraints using the provided programming interface. Using a simulator is a natural potential solution to checking for such constraints, but building simulation environments that can handle arbitrary tasks and their necessary objects and locations, is challenging. To address these challenges, we introduce ROBO-INSTRUCT, which synthesizes task-specific simulation…
Peer Reviews
Decision·Submitted to ICLR 2025
Clarity. The authors did a phenomenal job describing their method and experimental process with precision. The specifics of the robosim environments were easy to follow from the method section and the motivation for dynamic environment generation and its unique application to robotic service agents was well presented. Quality. The approach is simple and sound and I believe there is sufficient information for researchers to reproduce the results. On the RoboEval benchmark their method produces a
Presentation. Focusing solely on open models as being error-prone unnecessarily limits the scope and potential impact of the paper's contributions when it could be relevant to any base LLM. Quality. There was a limited diversity of baselines. The domain specific language looks very similar to the code-as-policies test environments. In the experimental section, the authors should contextualize the results by either explaining the best analogy to code-as-policies that they run or by explicitly di
Clarity: The framework's purpose, components, and experimental results are presented clearly, though some complex aspects could benefit from additional clarification (e.g., the alignment between ROBOSIM and traditional STRIPS planning). The experiment design is well-articulated, showing comparisons across multiple baselines and careful control of variables. Novelty: The integration of ROBOSIM for dynamic simulation and the INSTALIGN for instruction alignment introduce a novel approach to overco
(1) The data augmentation approach seems somewhat incremental, given its widespread use in Evol-Instruct, WizardLM, and similar frameworks. It would be valuable to explore more unique challenges and solutions tailored to robotics, which often requires handling more complex tasks. Additionally, an evaluation on scaling performance regarding parameter count, generalization, and related metrics would strengthen the analysis. (2) Another concern is that the evaluated tasks in the paper appear overl
1. Applying code to embodied AI is an important direction, so exploring how to enhance the code generation capabilities of LLMs in the robotics domain is also meaningful. 2. The idea of guiding the synthesized data with verification of the ROBOSIM environment is reasonable. 3. The experimental result looks promising.
I'd be happy to raise my score if my concerns are addressed. 1. There are many instruction tuning methods for code LLMs, but this paper only compares with SELF-INSTRUCT. Could the authors compare with methods like evol-instruct in Wizardcoder as well? 2. The choice of settings is somewhat confusing. For example, why use the Llama3 model to generate the training set and then fine-tune the 7B CodeLlama instead of Llama3 itself? Why not use more powerful closed-source models like GPT-3.5 or GPT-4-T
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Testing and Debugging Techniques · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · travel james · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Balanced Selection · Adam · Dropout · Dense Connections · Softmax · {Dispute@FaQ-s}How to file a dispute with Expedia?
