LATTE: LAnguage Trajectory TransformEr

Arthur Bucker; Luis Figueredo; Sami Haddadin; Ashish Kapoor; Shuang; Ma; Sai Vemprala; Rogerio Bonatti

arXiv:2208.02918·cs.RO·September 20, 2022·1 cites

LATTE: LAnguage Trajectory TransformEr

Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang, Ma, Sai Vemprala, Rogerio Bonatti

PDF

Open Access 2 Repos

TL;DR

LATTE introduces a transformer-based framework that interprets natural language and scene images to generate adaptable robotic trajectories across various robot types and environments, enhancing flexibility and generalizability.

Contribution

It extends previous work by incorporating 3D and velocity trajectory parametrization, using real scene images for context, and applying to diverse robot platforms beyond manipulation.

Findings

01

Successfully follows human intent in diverse scenarios

02

Adapts trajectory shape and speed effectively

03

Demonstrates applicability to aerial and legged robots

Abstract

Natural language is one of the most intuitive ways to express human intent. However, translating instructions and commands towards robotic motion generation and deployment in the real world is far from being an easy task. The challenge of combining a robot's inherent low-level geometric and kinodynamic constraints with a human's high-level semantic instructions traditionally is solved using task-specific solutions with little generalizability between hardware platforms, often with the use of static sets of target actions and commands. This work instead proposes a flexible language-based framework that allows a user to modify generic robotic trajectories. Our method leverages pre-trained language models (BERT and CLIP) to encode the user's intent and target objects directly from a free-form text input and scene images, fuses geometrical features generated by a transformer encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings