FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang

TL;DR
FRAPPE introduces a novel two-stage fine-tuning approach that enhances world modeling in generalist policies by aligning future representations with multiple visual models, improving efficiency and generalization in robotic tasks.
Contribution
The paper proposes FRAPPE, a new method that aligns future visual representations with multiple models, addressing over-emphasis on pixel reconstruction and reducing reliance on action data.
Findings
Outperforms state-of-the-art methods on RoboTwin benchmark
Demonstrates strong generalization in long-horizon tasks
Reduces training time and data requirements
Abstract
Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis
