Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions
Motonari Kambara, Komei Sugiura

TL;DR
This paper introduces the Case Relation Transformer, a novel crossmodal language generation model that creates detailed fetching instructions from images, enhancing robotic communication capabilities.
Contribution
The paper presents the CRT model that effectively integrates visual and geometric features using Transformer architecture for instruction generation.
Findings
CRT outperforms baseline methods in experiments
Human evaluation favors CRT-generated instructions
Effective integration of visual and geometric features
Abstract
There have been many studies in robotics to improve the communication skills of domestic service robots. Most studies, however, have not fully benefited from recent advances in deep neural networks because the training datasets are not large enough. In this paper, our aim is to augment the datasets based on a crossmodal language generation model. We propose the Case Relation Transformer (CRT), which generates a fetching instruction sentence from an image, such as "Move the blue flip-flop to the lower left box." Unlike existing methods, the CRT uses the Transformer to integrate the visual features and geometry features of objects in the image. The CRT can handle the objects because of the Case Relation Block. We conducted comparison experiments and a human evaluation. The experimental results show the CRT outperforms baseline methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
Methodstravel james · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dropout
