CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Giovana Kerche Bon\'as; Roseval Malaquias Junior; Marcos Piau; Thiago Laitz; Thales Sales Almeida; Hugo Abonizio; Celio Larcher; Ramon Pires; Rodrigo Nogueira

arXiv:2603.22576·cs.CL·March 25, 2026

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Giovana Kerche Bon\'as, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Celio Larcher, Ramon Pires, Rodrigo Nogueira

PDF

Open Access

TL;DR

CAPITU is a benchmark designed to evaluate the instruction-following abilities of Large Language Models in Brazilian Portuguese, using culturally-grounded literary tasks and automatic verification methods.

Contribution

It introduces a novel benchmark with culturally-contextualized, verifiable tasks in Portuguese, including diverse linguistic and structural constraints, and provides comprehensive evaluation of state-of-the-art models.

Findings

01

High accuracy of reasoning models like GPT-5.2 (98.5%)

02

Portuguese-specialized models offer cost-efficient performance

03

Multi-turn evaluation shows significant variation in constraint persistence

Abstract

We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling