Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Roseval Malaquias Junior; Giovana Kerche Bon\'as; Thales Sales Almeida; Hugo Abonizio; Thiago Laitz; Ramon Pires; Marcos Piau; Celio Larcher; Rodrigo Nogueira

arXiv:2605.01630·cs.CL·May 5, 2026

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Roseval Malaquias Junior, Giovana Kerche Bon\'as, Thales Sales Almeida, Hugo Abonizio, Thiago Laitz, Ramon Pires, Marcos Piau, Celio Larcher, Rodrigo Nogueira

PDF

TL;DR

Prosa introduces a rubric-based multi-judge evaluation method for Brazilian Portuguese chat models, demonstrating improved consistency and discriminative power over holistic scoring, and provides an open benchmark and tools.

Contribution

It presents Prosa, a novel multi-judge benchmark for Brazilian Portuguese chat models, and shows that rubric scoring reduces bias and enhances evaluation reliability.

Findings

01

Judges agree on all ranks with rubric scoring, only 7 out of 16 with holistic scoring.

02

Rubric filtering increases model score gaps by 47%, improving discrimination.

03

Evaluation cost per model is approximately $2.1 using Gemini 3 Flash.

Abstract

Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.