GPTs Are Multilingual Annotators for Sequence Generation Tasks

Juhwan Choi; Eunju Lee; Kyohoon Jin; YoungBin Kim

arXiv:2402.05512·cs.CL·February 9, 2024·1 cites

GPTs Are Multilingual Annotators for Sequence Generation Tasks

Juhwan Choi, Eunju Lee, Kyohoon Jin, YoungBin Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cost-effective, autonomous annotation method using large language models to generate datasets, especially benefiting low-resource languages, demonstrated through an image captioning dataset creation.

Contribution

It presents a novel approach leveraging large language models for autonomous data annotation, reducing costs and enabling low-resource language dataset construction.

Findings

01

Method is cost-efficient and scalable.

02

Effective for low-resource language annotation.

03

Constructed an image captioning dataset using the approach.

Abstract

Data annotation is an essential step for constructing new datasets. However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive. In addition, the complexity of this process increases when dealing with low-resource languages owing to the difference in the language pool of crowdworkers. To address these issues, this study proposes an autonomous annotation method by utilizing large language models, which have been recently demonstrated to exhibit remarkable performance. Through our experiments, we demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation. Additionally, we constructed an image captioning dataset using our approach and are committed to open this dataset for future study. We have opened our source code for further study and reproducibility.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

c-juhwan/gpt-multilingual-annotator
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems