Assessing In-context Learning and Fine-tuning for Topic Classification   of German Web Data

Julian Schelb; Roberto Ulloa; Andreas Spitz

arXiv:2407.16516·cs.CL·July 24, 2024

Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data

Julian Schelb, Roberto Ulloa, Andreas Spitz

PDF

TL;DR

This study compares fine-tuning and in-context learning methods for topic classification of German web pages, demonstrating that small annotated datasets and combined URL-content features enhance classification accuracy.

Contribution

It provides a comparative analysis of fine-tuning versus in-context learning for German web data classification, highlighting the effectiveness of small datasets and feature combinations.

Findings

01

Fine-tuning outperforms in-context learning.

02

Small annotated datasets are sufficient for effective classification.

03

Combining URL and content features yields the best results.

Abstract

Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.