Quati: A Brazilian Portuguese Information Retrieval Dataset from Native   Speakers

Mirelle Bueno; Eduardo Seiti de Oliveira; Rodrigo Nogueira; Roberto A.; Lotufo; Jayr Alencar Pereira

arXiv:2404.06976·cs.IR·April 11, 2024·1 cites

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A., Lotufo, Jayr Alencar Pereira

PDF

Open Access 2 Datasets

TL;DR

Quati is a new high-quality Brazilian Portuguese information retrieval dataset created with native speaker queries and web-sourced documents, utilizing advanced LLM labeling and serving as a benchmark for IR systems.

Contribution

It introduces Quati, the first comprehensive IR dataset for Brazilian Portuguese, with a novel annotation methodology using state-of-the-art LLMs for cost-effective labeling.

Findings

01

LLM-based annotations achieve human-level agreement.

02

Benchmarking of various IR systems on Quati.

03

Dataset is publicly available for research use.

Abstract

Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSparse Evolutionary Training