LATTE: LAnguage Trajectory TransformEr
Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang, Ma, Sai Vemprala, Rogerio Bonatti

TL;DR
LATTE introduces a transformer-based framework that interprets natural language and scene images to generate adaptable robotic trajectories across various robot types and environments, enhancing flexibility and generalizability.
Contribution
It extends previous work by incorporating 3D and velocity trajectory parametrization, using real scene images for context, and applying to diverse robot platforms beyond manipulation.
Findings
Successfully follows human intent in diverse scenarios
Adapts trajectory shape and speed effectively
Demonstrates applicability to aerial and legged robots
Abstract
Natural language is one of the most intuitive ways to express human intent. However, translating instructions and commands towards robotic motion generation and deployment in the real world is far from being an easy task. The challenge of combining a robot's inherent low-level geometric and kinodynamic constraints with a human's high-level semantic instructions traditionally is solved using task-specific solutions with little generalizability between hardware platforms, often with the use of static sets of target actions and commands. This work instead proposes a flexible language-based framework that allows a user to modify generic robotic trajectories. Our method leverages pre-trained language models (BERT and CLIP) to encode the user's intent and target objects directly from a free-form text input and scene images, fuses geometrical features generated by a transformer encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
