Multimodal embodiment-aware navigation transformer

Louis Dezons; Quentin Picard; R\'emi Marsal; Fran\c{c}ois Goulette; David Filliat

arXiv:2604.19267·cs.RO·April 22, 2026

Multimodal embodiment-aware navigation transformer

Louis Dezons, Quentin Picard, R\'emi Marsal, Fran\c{c}ois Goulette, David Filliat

PDF

TL;DR

ViLiNT is a multimodal, transformer-based navigation model that enhances ground robot robustness across environments by fusing visual, geometric, and embodiment data, and predicting collision-free trajectories.

Contribution

The paper introduces ViLiNT, a novel multimodal transformer architecture with embodiment-aware trajectory generation and ranking, improving zero-shot navigation robustness over state-of-the-art methods.

Findings

01

ViLiNT increases success rate by 166% over vision-only baselines.

02

It effectively fuses RGB images, LiDAR, goal embeddings, and embodiment info.

03

Real-world tests confirm improved obstacle avoidance in diverse environments.

Abstract

Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.