See, Say, and Segment: Teaching LMMs to Overcome False Premises

Tsung-Han Wu; Giscard Biamby; David Chan; Lisa Dunlap; Ritwik Gupta,; Xudong Wang; Joseph E. Gonzalez; Trevor Darrell

arXiv:2312.08366·cs.CV·December 14, 2023·1 cites

See, Say, and Segment: Teaching LMMs to Overcome False Premises

Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta,, Xudong Wang, Joseph E. Gonzalez, Trevor Darrell

PDF

Open Access

TL;DR

This paper introduces a joint training approach for Large Multimodal Models (LMMs) to accurately detect false premises in images, improve object presence detection, and provide helpful natural language feedback, addressing a key limitation of existing models.

Contribution

The authors propose a cascading and joint training method for LMMs that prevents catastrophic forgetting and enhances false premise detection and correction capabilities.

Findings

01

Detects false premises up to 55% better than existing methods.

02

Achieves over 31% relative cIOU improvement in false premise scenarios.

03

Provides helpful natural language feedback in 67% of cases.

Abstract

Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling