LERa: Replanning with Visual Feedback in Instruction Following

Svyatoslav Pchelintsev; Maxim Patratskiy; Anatoly Onishchenko; Alexandr Korchemnyi; Aleksandr Medvedev; Uliana Vinogradova; Ilya Galuzinsky; Aleksey Postnikov; Alexey K. Kovalev; Aleksandr I. Panov

arXiv:2507.05135·cs.RO·October 7, 2025

LERa: Replanning with Visual Feedback in Instruction Following

Svyatoslav Pchelintsev, Maxim Patratskiy, Anatoly Onishchenko, Alexandr Korchemnyi, Aleksandr Medvedev, Uliana Vinogradova, Ilya Galuzinsky, Aleksey Postnikov, Alexey K. Kovalev, Aleksandr I. Panov

PDF

Open Access

TL;DR

LERa is a visual feedback-based replanning method for robotics that improves task success rates by generating scene descriptions, explaining errors, and modifying plans using only RGB images and natural language instructions.

Contribution

LERa introduces a novel visual language model approach for replanning in robotics that requires minimal input and handles dynamic scene changes and failures effectively.

Findings

01

Achieves 40% improvement over baselines in dynamic environments.

02

Increases success rates by up to 67% in simulated tabletop tasks.

03

Proven effective in real-world robot experiments.

Abstract

Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. To address these challenges, we propose LERa - Look, Explain, Replan - a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection - without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look - where LERa generates a scene description and identifies errors; (ii) Explain - where it provides corrective guidance; and (iii) Replan - where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Technology Integration · Visual and Cognitive Learning Processes