Learning Articulated Motion Models from Visual and Lingual Signals

Zhengyang Wu; Mohit Bansal; Matthew R. Walter

arXiv:1511.05526·cs.RO·July 4, 2016

Learning Articulated Motion Models from Visual and Lingual Signals

Zhengyang Wu, Mohit Bansal, Matthew R. Walter

PDF

Open Access

TL;DR

This paper introduces a multimodal learning framework that combines visual and lingual signals to accurately model the kinematics of articulated objects, improving robot manipulation capabilities in human environments.

Contribution

It presents a novel probabilistic language model linking natural language descriptions with kinematic structures, enhancing model inference over vision-only methods.

Findings

01

36% improvement in model accuracy over vision-only baseline

02

Successfully infers kinematic structures of complex household objects

03

Demonstrates the effectiveness of multimodal signals in robotic manipulation

Abstract

In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Previous work learns kinematic models that prescribe this manipulation from visual demonstrations. Lingual signals, such as natural language descriptions and instructions, offer a complementary means of conveying knowledge of such manipulation models and are suitable to a wide range of interactions (e.g., remote manipulation). In this paper, we present a multimodal learning framework that incorporates both visual and lingual information to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization