MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language   Models

Ming-Chang Chiu; Shicheng Wen; Pin-Yu Chen; Xuezhe Ma

arXiv:2412.03927·cs.CV·December 6, 2024

MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models

Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, Xuezhe Ma

PDF

Open Access 1 Video

TL;DR

MegaCOIN introduces a comprehensive dataset for evaluating and improving vision-language models' ability to perceive subtle color and environmental details, advancing their contextual understanding and domain generalization.

Contribution

The paper presents MegaCOIN, a new high-quality dataset with annotations for color and environment, and demonstrates its utility in benchmarking and enhancing vision-language models.

Findings

01

VLMs have limited color recognition capabilities.

02

Fine-tuning with MegaCOIN improves model performance.

03

Open-source models can outperform GPT-4o after fine-tuning.

Abstract

In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model's capacity to discern subtle color variations and spatial context -- critical elements for situational comprehension and reliable deployment across real-world applications. Toward that goal, we curate MegaCOIN, a high-quality, human-labeled dataset based on \emph{real} images with various contextual attributes. MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000 real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MegaCoin: Enhancing Medium-Grained Color Perception for Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training