Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers
Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A., Lotufo, Jayr Alencar Pereira

TL;DR
Quati is a new high-quality Brazilian Portuguese information retrieval dataset created with native speaker queries and web-sourced documents, utilizing advanced LLM labeling and serving as a benchmark for IR systems.
Contribution
It introduces Quati, the first comprehensive IR dataset for Brazilian Portuguese, with a novel annotation methodology using state-of-the-art LLMs for cost-effective labeling.
Findings
LLM-based annotations achieve human-level agreement.
Benchmarking of various IR systems on Quati.
Dataset is publicly available for research use.
Abstract
Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsSparse Evolutionary Training
