ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Zhuoyang Zhang; Shang Yang; Qinghao Hu; Luke J. Huang; James Hou; Yufei Sun; Yao Lu; Song Han

arXiv:2602.12322·cs.RO·February 16, 2026

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Zhuoyang Zhang, Shang Yang, Qinghao Hu, Luke J. Huang, James Hou, Yufei Sun, Yao Lu, Song Han

PDF

Open Access

TL;DR

ForeAct introduces an efficient visual foresight planning approach that enhances vision-language-action models by predicting future observations to improve accuracy and generalization in open-world tasks.

Contribution

The paper presents a novel foresight planning module that predicts future observations, enabling better visuo-motor inference without architectural changes to existing VLAs.

Findings

01

Achieves 87.4% success rate on real-world tasks

02

Improves performance by over 40% compared to baseline

03

Foresight generator trained on 1 million episodes

Abstract

Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640 $\times$ 480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning