German Text Embedding Clustering Benchmark

Silvan Wehrli; Bert Arnrich; Christopher Irrgang

arXiv:2401.02709·cs.CL·January 8, 2024·2 cites

German Text Embedding Clustering Benchmark

Silvan Wehrli, Bert Arnrich, Christopher Irrgang

PDF

Open Access 1 Repo 4 Datasets

TL;DR

This paper presents a benchmark for clustering German text embeddings, evaluating various models and techniques to improve clustering performance, especially for short texts, with publicly available resources.

Contribution

It introduces a new benchmark for German text embedding clustering, including analysis of models, dimensionality reduction, and continued pre-training effects.

Findings

01

Strong performance from mono- and multilingual models

02

Dimensionality reduction improves clustering results

03

Continued pre-training benefits short text clustering

Abstract

This work introduces a benchmark assessing the performance of clustering German text embeddings in different domains. This benchmark is driven by the increasing use of clustering neural text embeddings in tasks that require the grouping of texts (such as topic modeling) and the need for German resources in existing benchmarks. We provide an initial analysis for a range of pre-trained mono- and multilingual models evaluated on the outcome of different clustering algorithms. Results include strong performing mono- and multilingual models. Reducing the dimensions of embeddings can further improve clustering. Additionally, we conduct experiments with continued pre-training for German BERT models to estimate the benefits of this additional training. Our experiments suggest that significant performance improvements are possible for short text. All code and datasets are publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

climsocana/tecb-de
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Weight Decay · WordPiece · Softmax · Adam · Dropout