OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and   Understanding

Tao Zhang; Xiangtai Li; Hao Fei; Haobo Yuan; Shengqiong Wu; Shunping; Ji; Chen Change Loy; Shuicheng Yan

arXiv:2406.19389·cs.CV·October 2, 2024·5 cites

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping, Ji, Chen Change Loy, Shuicheng Yan

PDF

Open Access 1 Repo 1 Video

TL;DR

OMG-LLaVA is a unified framework that combines pixel-level image understanding with reasoning and language capabilities, enabling flexible multimodal interactions and surpassing specialized methods on multiple benchmarks.

Contribution

It introduces a novel end-to-end model integrating universal segmentation with large language models for comprehensive visual reasoning and understanding.

Findings

01

Achieves image, object, and pixel-level reasoning in a single model.

02

Surpasses performance of specialized methods on multiple benchmarks.

03

Supports flexible user interaction with visual and text prompts.

Abstract

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lxtgh/omg-seg
pytorch

Videos

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding· slideslive

Taxonomy

TopicsCell Image Analysis Techniques · Brain Tumor Detection and Classification · Explainable Artificial Intelligence (XAI)