MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation
Ali Al-Bustami, Jaerock Kwon

TL;DR
MiniVLA-Nav v1 is a comprehensive simulation dataset designed for evaluating language-conditioned robot navigation across diverse environments, with detailed annotations and multiple evaluation splits.
Contribution
The paper introduces MiniVLA-Nav v1, a new multi-scene simulation dataset for language-guided robot navigation with rich annotations and diverse evaluation benchmarks.
Findings
Dataset contains 1,174 episodes with synchronized images, depth, and segmentation masks.
Includes multiple environments, object categories, and paraphrase templates for robust evaluation.
Supports in-distribution, template-paraphrase, and OOD object-category benchmarking.
Abstract
We present MiniVLA-Nav v1, a simulation dataset for Language-Conditioned Object Approach (LCOA) navigation: given a short natural-language instruction, an NVIDIA Nova Carter differential-drive robot must navigate to the named object and stop within 1 m across four photorealistic Isaac Sim environments (Office, Hospital, Full Warehouse, and Warehouse with Multiple Shelves). Each of the 1,174 episodes pairs an instruction with synchronized 640x640 RGB images, metric depth maps (float32, metres), and instance segmentation masks, together with continuous (v,omega) and 7x7 tokenized expert action labels recorded at 60 Hz from a vision-based proportional controller. Trajectory diversity is ensured through three spawn-distance tiers (near: 1.5-3.5 m, mid: 3.5-7.0 m, far: global curated points; Pearson r=0.94 between spawn distance and trajectory length), 12 object categories, 18 training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
