Are Multimodal Large Language Models Good Annotators for Image Tagging?

Ming-Kun Xie; Jia-Hao Xiao; Zhiqiang Kou; Zhongnian Li; Gang Niu; Masashi Sugiyama

arXiv:2602.20972·cs.CV·February 25, 2026

Are Multimodal Large Language Models Good Annotators for Image Tagging?

Ming-Kun Xie, Jia-Hao Xiao, Zhiqiang Kou, Zhongnian Li, Gang Niu, Masashi Sugiyama

PDF

Open Access

TL;DR

This paper evaluates the potential of Multimodal Large Language Models (MLLMs) as automated image taggers, demonstrating significant cost savings and promising annotation quality, and introduces TagLLM, a framework to improve MLLM annotation accuracy for downstream tasks.

Contribution

The paper analyzes MLLM annotation capabilities, quantifies cost and quality gaps compared to humans, and proposes TagLLM, a novel framework to enhance MLLM-based image tagging accuracy.

Findings

01

MLLMs can reduce annotation costs to less than 0.1% of human effort.

02

MLLM annotations reach 50-80% of human quality.

03

TagLLM narrows the annotation gap, improving downstream task performance by 60-80%.

Abstract

Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50\% to 80\% of human performance, while achieving over 90\% performance on downstream training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Domain Adaptation and Few-Shot Learning