Multimodal Diffusion Segmentation Model for Object Segmentation from   Manipulation Instructions

Yui Iioka; Yu Yoshida; Yuiga Wada; Shumpei Hatanaka; Komei Sugiura

arXiv:2307.08597·cs.CV·July 18, 2023

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka, Komei Sugiura

PDF

Open Access

TL;DR

This paper introduces the Multimodal Diffusion Segmentation Model (MDSM) that understands complex natural language instructions and generates precise segmentation masks for objects in indoor scenes, outperforming existing methods.

Contribution

The paper presents a novel two-stage diffusion-based model with crossmodal feature extraction for language-guided object segmentation in complex indoor environments.

Findings

01

MDSM achieved +10.13 mean IoU over baseline.

02

The model effectively handles complex referring expressions.

03

A new dataset was created for evaluation.

Abstract

In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems

MethodsDiffusion