VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain, Jianwei Yang, Humphrey Shi

TL;DR
VCoder enhances multimodal large language models' perception capabilities by integrating versatile vision encoders trained on a new dataset, significantly improving object perception tasks and opening avenues for advanced visual reasoning.
Contribution
Introduction of VCoder as a versatile perception module, creation of the COST dataset for object perception, and development of new metrics to evaluate perception in MLLMs.
Findings
VCoder outperforms existing MLLMs like GPT-4V in object perception tasks.
The COST dataset enables robust training and evaluation of perception abilities.
Metrics introduced effectively measure object-level perception in MLLMs.
Abstract
Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
