GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi, Eunju Lee, Kyohoon Jin, YoungBin Kim

TL;DR
This paper introduces a cost-effective, autonomous annotation method using large language models to generate datasets, especially benefiting low-resource languages, demonstrated through an image captioning dataset creation.
Contribution
It presents a novel approach leveraging large language models for autonomous data annotation, reducing costs and enabling low-resource language dataset construction.
Findings
Method is cost-efficient and scalable.
Effective for low-resource language annotation.
Constructed an image captioning dataset using the approach.
Abstract
Data annotation is an essential step for constructing new datasets. However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive. In addition, the complexity of this process increases when dealing with low-resource languages owing to the difference in the language pool of crowdworkers. To address these issues, this study proposes an autonomous annotation method by utilizing large language models, which have been recently demonstrated to exhibit remarkable performance. Through our experiments, we demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation. Additionally, we constructed an image captioning dataset using our approach and are committed to open this dataset for future study. We have opened our source code for further study and reproducibility.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
