Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
Haoming Lu, Feifei Zhong

TL;DR
This study demonstrates that Vision-Language Models can achieve high-quality image annotations on CelebA at a fraction of the cost of manual annotation, showing promise as a scalable alternative.
Contribution
It provides empirical evidence that VLMs can replace human annotators for certain tasks, with improved consistency and significant cost savings.
Findings
VLM annotations agree 79.5% with human labels
Re-annotations increase agreement to 89.1%
AI annotation costs are less than 1% of manual costs
Abstract
This study evaluates the capability of Vision-Language Models (VLMs) in image data annotation by comparing their performance on the CelebA dataset in terms of quality and cost-effectiveness against manual annotation. Annotations from the state-of-the-art LLaVA-NeXT model on 1000 CelebA images are in 79.5% agreement with the original human annotations. Incorporating re-annotations of disagreed cases into a majority vote boosts AI annotation consistency to 89.1% and even higher for more objective labels. Cost assessments demonstrate that AI annotation significantly reduces expenditures compared to traditional manual methods -- representing less than 1% of the costs for manual annotation in the CelebA dataset. These findings support the potential of VLMs as a viable, cost-effective alternative for specific annotation tasks, reducing both financial burden and ethical concerns associated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
