Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Ilia Larchenko; Gleb Zarin; Akash Karnatak

arXiv:2512.06951·cs.RO·December 23, 2025

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Ilia Larchenko, Gleb Zarin, Akash Karnatak

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper introduces a novel vision-action policy that excels in complex household tasks, leveraging innovative training and inference techniques to achieve top performance in the 2025 BEHAVIOR Challenge.

Contribution

It presents new methods like correlated noise for flow matching and correlation-aware inpainting, advancing the state-of-the-art in vision-language-action models for long-horizon tasks.

Findings

01

Achieved 26% q-score on all tasks in the challenge

02

Introduced correlated noise for improved training efficiency

03

Implemented correlation-aware inpainting for smoother actions

Abstract

We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

IliaLarchenko/behavior_224_rgb
dataset· 2.8k dl
2.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning