Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target-Source Classification
Aly Magassouba, Komei Sugiura, Anh Trinh Quoc, Hisashi Kawai

TL;DR
This paper introduces a multimodal classifier model for understanding domestic robot fetching instructions, enhanced with GANs for data augmentation, outperforming previous methods in scene understanding tasks.
Contribution
The paper presents a novel Multimodal Target-source Classifier Model (MTCM) and extends it with GANs to improve instruction understanding in robotic fetching tasks.
Findings
MTCM outperforms state-of-the-art methods on standard datasets.
GAN extension enables effective data augmentation and classification.
The approach accurately infers target and source regions in scenes.
Abstract
In this paper, we address multimodal language understanding for unconstrained fetching instruction in domestic service robots context. A typical fetching instruction such as "Bring me the yellow toy from the white shelf" requires to infer the user intention, that is what object (target) to fetch and from where (source). To solve the task, we propose a Multimodal Target-source Classifier Model (MTCM), which predicts the region-wise likelihood of target and source candidates in the scene. Unlike other methods, MTCM can handle regionwise classification based on linguistic and visual features. We evaluated our approach that outperformed the state-of-the-art method on a standard data set. In addition, we extended MTCM with Generative Adversarial Nets (MTCM-GAN), and enabled simultaneous data augmentation and classification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
