Multimodal embodiment-aware navigation transformer
Louis Dezons, Quentin Picard, R\'emi Marsal, Fran\c{c}ois Goulette, David Filliat

TL;DR
ViLiNT is a multimodal, transformer-based navigation model that enhances ground robot robustness across environments by fusing visual, geometric, and embodiment data, and predicting collision-free trajectories.
Contribution
The paper introduces ViLiNT, a novel multimodal transformer architecture with embodiment-aware trajectory generation and ranking, improving zero-shot navigation robustness over state-of-the-art methods.
Findings
ViLiNT increases success rate by 166% over vision-only baselines.
It effectively fuses RGB images, LiDAR, goal embeddings, and embodiment info.
Real-world tests confirm improved obstacle avoidance in diverse environments.
Abstract
Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
