LaSagnA: Language-based Segmentation Assistant for Complex Queries

Cong Wei; Haoxian Tan; Yujie Zhong; Yujiu Yang; Lin Ma

arXiv:2404.08506·cs.CV·April 15, 2024·1 cites

LaSagnA: Language-based Segmentation Assistant for Complex Queries

Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma

PDF

Open Access 1 Repo

TL;DR

LaSagnA introduces a novel approach that enhances vision-language models to handle complex queries involving multiple targets and object absence detection through a new query format and segmentation strategies.

Contribution

The paper proposes a new sequence format for complex queries and integrates a semantic segmentation task, improving vLLMs' ability to process intricate visual queries.

Findings

01

Achieves comparable results with traditional methods on segmentation datasets.

02

Outperforms existing vLLMs in reasoning and referring segmentation tasks.

03

Demonstrates effectiveness in handling complex, multi-target queries.

Abstract

Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incapability of handling multiple targets per query and the failure to identify the absence of query objects in the image. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. Consequently, we define the general sequence format for complex queries. Then we incorporate a semantic segmentation task in the current pipeline to fulfill the requirements of training data. Furthermore, we present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format. The effectiveness of our model in processing complex queries is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

congvvc/lasagna
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques