Where to Fetch: Extracting Visual Scene Representation from Large   Pre-Trained Models for Robotic Goal Navigation

Yu Li; Dayou Li; Chenkun Zhao; Ruifeng Wang; Ran Song; Wei Zhang

arXiv:2408.10578·cs.RO·August 21, 2024

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

Yu Li, Dayou Li, Chenkun Zhao, Ruifeng Wang, Ran Song, Wei Zhang

PDF

Open Access

TL;DR

This paper introduces a visual scene representation derived from large-scale visual language models, enabling robots to interpret natural language instructions and navigate complex environments effectively.

Contribution

The work presents a novel scene representation method that integrates visual language models with large language models for improved robotic goal navigation.

Findings

01

Enables robots to follow diverse natural language instructions.

02

Improves environment understanding for goal-directed navigation.

03

Demonstrates successful complex task completion in experiments.

Abstract

To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robotic Path Planning Algorithms