MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models
Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, Xuezhe Ma

TL;DR
MegaCOIN introduces a comprehensive dataset for evaluating and improving vision-language models' ability to perceive subtle color and environmental details, advancing their contextual understanding and domain generalization.
Contribution
The paper presents MegaCOIN, a new high-quality dataset with annotations for color and environment, and demonstrates its utility in benchmarking and enhancing vision-language models.
Findings
VLMs have limited color recognition capabilities.
Fine-tuning with MegaCOIN improves model performance.
Open-source models can outperform GPT-4o after fine-tuning.
Abstract
In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model's capacity to discern subtle color variations and spatial context -- critical elements for situational comprehension and reliable deployment across real-world applications. Toward that goal, we curate MegaCOIN, a high-quality, human-labeled dataset based on \emph{real} images with various contextual attributes. MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000 real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
