DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

Iana Zhura; Yara Mahmoud; Jeffrin Sam; Hung Khang Nguyen; Didar Seyidov; Miguel Altamirano Cabrera; Dzmitry Tsetserukou

arXiv:2603.26322·cs.RO·March 30, 2026

DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

Iana Zhura, Yara Mahmoud, Jeffrin Sam, Hung Khang Nguyen, Didar Seyidov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

PDF

TL;DR

DiffusionAnything introduces a unified, end-to-end diffusion model for robotic navigation and manipulation that operates from RGB images with minimal self-supervised data, enabling zero-shot generalization in new environments.

Contribution

The paper presents a novel diffusion-based framework with multi-scale feature modulation for unified navigation and manipulation, requiring only 5 minutes of self-supervised data per task.

Findings

01

Achieves robust zero-shot generalization to unseen scenes.

02

Operates at 10 Hz using only RGB input.

03

Requires minimal self-supervised data (5 minutes per task).

Abstract

Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.