# CORHOH: Text corpus of holocaust oral histories

**Authors:** Daban Q. Jaff

PMC · DOI: 10.1016/j.dib.2025.111426 · 2025-02-24

## TL;DR

The paper introduces CORHOH, a corpus of 500 Holocaust survivor oral histories with annotations for use in trauma and NLP research.

## Contribution

The novel contribution is the creation of a structured, annotated corpus of Holocaust oral histories with rich metadata and TEI compliance.

## Key findings

- The corpus includes 500 oral histories with 106,519 questions and 107,125 answers, each uniquely identified.
- It supports trauma expression analysis and NLP techniques like sentiment analysis and named entity recognition.
- The data is publicly available under a CC BY-NC-SA 4.0 license from the USHMM.

## Abstract

This paper outlines the compilation and annotation process of CORHOH: Text CORpus of Holocaust Oral Histories. The corpus consists of 500 oral histories, each narrative form one survivor. The transcripts of the oral histories are retrieved from the Let Them Speak Project [1]. The transcripts are normalized and further annotated. The corpus offers rich metadata about both the testimony givers and the interviews. All technical content is removed, and a unique identifier is assigned to each question (posed by the interviewer) and answer (provided by the survivor). The corpus complies with the TEI guidelines [2]. The corpus includes 106,519 questions and 107,125 answers, making it easy to distinguish between the utterances that belong to the holocaust survivor or anyone else who is involved in the interview, primarily the interviewer. CORHOH is particularly suited for studies on trauma expression and psychological concepts embedded in survivors' narratives. Additionally, it offers potential for data mining to uncover patterns (e.g., migration trends) and supports natural language processing techniques, such as topic modelling, sentiment analysis, and named entity recognition. The CORHOH data is courtesy of the United States Holocaust Memorial Museum (USHMM) and is publicly available under the CC BY-NC-SA 4.0 license.

## Full-text entities

- **Diseases:** trauma (MESH:D014947)

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC11927712/full.md

---
Source: https://tomesphere.com/paper/PMC11927712