FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Han Zhao; Jingbo Wang; Wenxuan Song; Shuai Chen; Yang Liu; Yan Wang; Haoang Li; Donglin Wang

arXiv:2602.17259·cs.RO·February 20, 2026

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang

PDF

Open Access

TL;DR

FRAPPE introduces a novel two-stage fine-tuning approach that enhances world modeling in generalist policies by aligning future representations with multiple visual models, improving efficiency and generalization in robotic tasks.

Contribution

The paper proposes FRAPPE, a new method that aligns future visual representations with multiple models, addressing over-emphasis on pixel reconstruction and reducing reliance on action data.

Findings

01

Outperforms state-of-the-art methods on RoboTwin benchmark

02

Demonstrates strong generalization in long-horizon tasks

03

Reduces training time and data requirements

Abstract

Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis