TagGPT: Large Language Models are Zero-shot Multimodal Taggers
Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan

TL;DR
TagGPT leverages large language models with prompt engineering to perform zero-shot multimodal tag extraction and tagging, improving multimedia content distribution without task-specific training.
Contribution
This work introduces TagGPT, a modular zero-shot tagging system using LLMs and sentence embeddings, capable of handling various modalities and outperforming existing taggers.
Findings
Effective zero-shot multimodal tagging demonstrated on public datasets.
TagGPT outperforms existing hashtag and tagger methods.
Flexible modular framework adaptable to different LLMs and embeddings.
Abstract
Tags are pivotal in facilitating the effective distribution of multimedia content in various applications in the contemporary Internet era, such as search engines and recommendation systems. Recently, large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion. Our core insight is that, through elaborate prompt engineering, LLMs are able to extract and reason about proper tags given textual clues of multimodal data, e.g., OCR, ASR, title, etc. Specifically, to automatically build a high-quality tag set that reflects user intent and interests for a specific application, TagGPT predicts large-scale candidate tags from a series of raw data via prompting LLMs, filtered with frequency and semantics. Given a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis
