FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang, Zhang, Kun Yao, Errui Ding, Jingdong Wang

TL;DR
FullAnno is a scalable data engine that generates high-quality, detailed image annotations to improve multimodal large language models' understanding and reasoning in vision-language tasks.
Contribution
The paper introduces FullAnno, a novel cascade annotation system that re-annotates datasets with richer, more detailed image labels to enhance MLLMs' performance.
Findings
Re-annotated COCO and Visual Genome datasets with increased object and caption details.
Enhanced LLaVA-v1.5 performance on multiple benchmarks using FullAnno data.
Tripled object annotations and 15-fold longer captions compared to original datasets.
Abstract
Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
