# PAP900: A dataset of semantic relationships between affective words in Portuguese

**Authors:** André Fernandes dos Santos, José Paulo Leal, Rui Alexandre Alves, Teresa Jacques

PMC · DOI: 10.1016/j.dib.2025.111726 · 2025-05-30

## TL;DR

PAP900 is a Portuguese dataset of 900 affective word pairs annotated for semantic similarity and relatedness by over 30 raters each.

## Contribution

PAP900 is the first Portuguese dataset focusing on affective words with detailed annotations and annotator sociodemographics.

## Key findings

- The dataset includes semantic similarity and relatedness ratings for 900 affective word pairs.
- Annotator sociodemographics are included to study their influence on semantic perception.
- The dataset is available in multiple formats for diverse research needs.

## Abstract

The PAP900 dataset centers on the semantic relationship between affective words in Portuguese. It contains 900 word pairs, each annotated by at least 30 human raters for both semantic similarity and semantic relatedness. In addition to the semantic ratings, the dataset includes the word categorization used to build the word pairs and detailed sociodemographic information about annotators, enabling the analysis of the influence of personal factors on the perception of semantic relationships. Furthermore, this article describes in detail the dataset construction process, from word selection to agreement metrics.

Data was collected from Portuguese university psychology students, who completed two rounds of questionnaires. In the first round annotators were asked to rate word pairs on either semantic similarity or relatedness. The second round switched the relation type for most annotators, with a small percentage being asked to repeat the same relation. The instructions given emphasized the differences between semantic relatedness and semantic similarity, and provided examples of expected ratings of both.

There are few semantic relations datasets in Portuguese, and none focusing on affective words. PAP900 is distributed in distinct formats to be easy to use for both researchers just looking for the final averaged values and for researchers looking to take advantage of the individual ratings, the word categorization and the annotator data. This dataset is a valuable resource for researchers in computational linguistics, natural language processing, psychology, and cognitive science.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12173755/full.md

---
Source: https://tomesphere.com/paper/PMC12173755