Case Relation Transformer: A Crossmodal Language Generation Model for   Fetching Instructions

Motonari Kambara; Komei Sugiura

arXiv:2107.00789·cs.RO·July 5, 2021

Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

Motonari Kambara, Komei Sugiura

PDF

Open Access

TL;DR

This paper introduces the Case Relation Transformer, a novel crossmodal language generation model that creates detailed fetching instructions from images, enhancing robotic communication capabilities.

Contribution

The paper presents the CRT model that effectively integrates visual and geometric features using Transformer architecture for instruction generation.

Findings

01

CRT outperforms baseline methods in experiments

02

Human evaluation favors CRT-generated instructions

03

Effective integration of visual and geometric features

Abstract

There have been many studies in robotics to improve the communication skills of domestic service robots. Most studies, however, have not fully benefited from recent advances in deep neural networks because the training datasets are not large enough. In this paper, our aim is to augment the datasets based on a crossmodal language generation model. We propose the Case Relation Transformer (CRT), which generates a fetching instruction sentence from an image, such as "Move the blue flip-flop to the lower left box." Unlike existing methods, the CRT uses the Transformer to integrate the visual features and geometry features of objects in the image. The CRT can handle the objects because of the Case Relation Block. We conducted comparison experiments and a human evaluation. The experimental results show the CRT outperforms baseline methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

Methodstravel james · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dropout