University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Ikhlasul Akmal Hanif; Eryawan Presma Yulianrifat; Jaycent Gunawan Ongris; Eduardus Tjitrahardja; Muhammad Falensi Azmi; Rahmat Bryan Naufal; Alfan Farizki Wicaksono

arXiv:2505.16460·cs.CL·May 23, 2025

University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Ikhlasul Akmal Hanif, Eryawan Presma Yulianrifat, Jaycent Gunawan Ongris, Eduardus Tjitrahardja, Muhammad Falensi Azmi, Rahmat Bryan Naufal, Alfan Farizki Wicaksono

PDF

Open Access

TL;DR

This paper evaluates various encoder models for multi-label emotion detection across 28 languages, finding that prompt-based encoders with classifier-only training outperform fully fine-tuned models, with ensemble methods achieving the best results.

Contribution

It introduces a comprehensive comparison of fine-tuning versus classifier-only training strategies for multilingual emotion detection using state-of-the-art encoders.

Findings

01

Prompt-based encoders like mE5 and BGE outperform fully fine-tuned models.

02

Ensemble of BGE models with CatBoost achieves 56.58 F1-macro score.

03

Classifier-only training is more effective than full fine-tuning in this task.

Abstract

This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection · Mental Health via Writing

MethodsmBERT