LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux; Mohammad N. S. Jahromi; Rohat Bakuri-J{\o}rgensen; Marieke Anne Heyl; Asta S. Stage Jarlner; Maria Vlachou; Anna Murphy H{\o}genhaug; Desmond Elliott; Thomas Gammeltoft-Hansen; Thomas B. Moeslund

arXiv:2605.13412·cs.CL·May 14, 2026

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-J{\o}rgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy H{\o}genhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund

PDF

1 Repo 1 Datasets

TL;DR

This study evaluates the effectiveness of large language models in annotating Danish asylum decision texts for credibility, introducing a new dataset and analyzing errors beyond aggregate metrics.

Contribution

It introduces RAB-Cred, a high-quality Danish legal NLP dataset, and systematically benchmarks multiple LLMs and prompts for credibility assessment in asylum decisions.

Findings

01

LLMs show potential for cost-effective annotation but are inconsistent.

02

Error analysis reveals model and prompt choice significantly impact performance.

03

The dataset and code are publicly available for further research.

Abstract

Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

glhr/RAB-Cred
github

Datasets

XAI-CRED/RAB-Cred
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.