VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Jitesh Jain; Jianwei Yang; Humphrey Shi

arXiv:2312.14233·cs.CV·December 25, 2023·1 cites

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Jitesh Jain, Jianwei Yang, Humphrey Shi

PDF

Open Access 1 Repo

TL;DR

VCoder enhances multimodal large language models' perception capabilities by integrating versatile vision encoders trained on a new dataset, significantly improving object perception tasks and opening avenues for advanced visual reasoning.

Contribution

Introduction of VCoder as a versatile perception module, creation of the COST dataset for object perception, and development of new metrics to evaluate perception in MLLMs.

Findings

01

VCoder outperforms existing MLLMs like GPT-4V in object perception tasks.

02

The COST dataset enables robust training and evaluation of perception abilities.

03

Metrics introduced effectively measure object-level perception in MLLMs.

Abstract

Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shi-labs/vcoder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling