Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
Ilia Larchenko, Gleb Zarin, Akash Karnatak

TL;DR
This paper introduces a novel vision-action policy that excels in complex household tasks, leveraging innovative training and inference techniques to achieve top performance in the 2025 BEHAVIOR Challenge.
Contribution
It presents new methods like correlated noise for flow matching and correlation-aware inpainting, advancing the state-of-the-art in vision-language-action models for long-horizon tasks.
Findings
Achieved 26% q-score on all tasks in the challenge
Introduced correlated noise for improved training efficiency
Implemented correlation-aware inpainting for smoother actions
Abstract
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
