LISA: Reasoning Segmentation via Large Language Model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu,, Jiaya Jia

TL;DR
LISA introduces a novel reasoning segmentation task enabling large language models to generate segmentation masks from complex, implicit queries, supported by a new benchmark and a specialized model architecture.
Contribution
The paper proposes a new reasoning segmentation task, establishes a benchmark dataset, and develops LISA, a multimodal LLM capable of producing segmentation masks with reasoning abilities.
Findings
LISA can handle complex reasoning and world knowledge in segmentation.
LISA demonstrates strong zero-shot performance on reasoning tasks.
Fine-tuning with minimal data further improves segmentation accuracy.
Abstract
Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper proposed an interesting view to generate segmentations from language inputs. It utilizes LLMs and multimodal LLMs to understand language sentences as input and produce segmentation embedding as output. The experiments evaluate the model performances from multiple segmentations which demonstrated the improvements.
The work uses pre-trained LLMs and MLLMs as pre-stage, so it involves more learned external knowledge in the proposed pipeline. This would not be fair enough for the methods without using pre-trained foundation models. The token "SEG" is only one token designed for the task, so how to use it for multiple segmentation masks The reasoning question can be recognized by foundation models, so the reasoning capacity of the model actually not from the proposed components.
This paper proposes an interesting task, and the proposed method seems effective and promising. The utilization of pretrained vision expert seems to be a clever way of enabling vision ability of the LLM.
I have the following concerns about the paper: 1. I wonder if the model is able to perform instance segmentation, is it able to output multiple masks in one answer? For example, if there are two men, can I obtain answer like: the mask for the first man <seg>, and the mask for the second man <seg> ? 2. I wonder how the model performs on text-generation task, does the model preserve the original ability to perform conversation? I hope the authors can experimentally verify this. I will consider up
1. The new task Reasoning Segmentation (ReasonSeg) seems like a natural progression from its similar task: Referring Expression Segmentation (RES) and Open-Vocabulary Segmentation (OVSeg). ReasonSeg is more challenging, less restrictive, and more flexible in real-world applications. 2. The proposed pipeline LISA is a straightforward and effective solution in adapting existing models and datasets to achieve the challenging ReasonSeg task with reasonable training requirements. 3. This paper yiel
1. LISA has limited technical contributions in its design. 2. Although the new task ReasonSeg is well motivated, (which requires two key capabilities: 1. long text understanding and 2. segmentation), LISA's capabilities are not fully motivated and evaluated. As described by the author: "LISA can handle various scenarios, including 1) complex reasoning; 2) world knowledge; 3) explanatory answers; and 4) multi-turn conversations.", however, this paper mainly evaluates LISA's capability in segmenta
- A novel and interesting task setting: reasoning segmentation task. - The authors used a special token <SEG> to bridge the LLM and SAM, and used the LoRA technique to fine-tune the model. In this way, the LISA could contain both LLM's reasoning ability and the SAM's segmentation ability. - The usage of LoRA could help the model to achieve good performance with a limited training dataset and computation source, which brings light for most labs in the community. - Extensive experiments and good p
- More experiments are needed to make the claim clear: the authors mention that "239 reasoning segmentation image-instruction pairs results in further performance enhancement", this is impressive. However, where comes the number? Would the performance get further boosted with more data samples? - Some direct while simple comparison is needed: how about just using Grounding-SAM, the combination of Grounding-DINO with SAM? What would be the strengths of the LISA. - Some writing is unclear, for e
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
