LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
Taja Kuzman, Nikola Ljube\v{s}i\'c

TL;DR
This paper introduces a teacher-student framework using large language models for multilingual news classification without manual annotation, achieving high accuracy and cross-lingual transfer with minimal labeled data.
Contribution
The study presents a novel LLM-based teacher-student approach for zero-annotation multilingual news classification, demonstrating effective automatic dataset creation and strong cross-lingual performance.
Findings
Teacher model achieves high zero-shot accuracy across four languages.
Student models perform comparably to the teacher with limited data.
Strong zero-shot cross-lingual capabilities demonstrated.
Abstract
With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news topic classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop a news topic training dataset through automatic annotation of 20,000 news articles in Slovenian, Croatian, Greek, and Catalan. Articles are classified into 17 main categories from the Media Topic schema, developed by the International Press Telecommunications Council (IPTC). The teacher model exhibits high zero-shot performance in all four languages. Its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Text and Document Classification Technologies
MethodsAttention Is All You Need · Residual Connection · Softmax · Adam · Label Smoothing · Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding
