LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Pedro Barbosa de Carvalho Neto

TL;DR
LegalBench-BR is a new benchmark dataset for evaluating large language models on Brazilian legal text classification, demonstrating the effectiveness of domain-specific fine-tuning over general-purpose models.
Contribution
It introduces the first public benchmark for Brazilian legal classification and shows that domain-adapted fine-tuning significantly improves model performance.
Findings
Fine-tuned BERTimbau-LoRA achieves 87.6% accuracy.
Domain-adapted models outperform general-purpose LLMs by 22-28 percentage points.
Fine-tuning eliminates systematic biases in legal classification.
Abstract
We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
