CORHOH: Text corpus of holocaust oral histories
Daban Q. Jaff

TL;DR
The paper introduces CORHOH, a corpus of 500 Holocaust survivor oral histories with annotations for use in trauma and NLP research.
Contribution
The novel contribution is the creation of a structured, annotated corpus of Holocaust oral histories with rich metadata and TEI compliance.
Findings
The corpus includes 500 oral histories with 106,519 questions and 107,125 answers, each uniquely identified.
It supports trauma expression analysis and NLP techniques like sentiment analysis and named entity recognition.
The data is publicly available under a CC BY-NC-SA 4.0 license from the USHMM.
Abstract
This paper outlines the compilation and annotation process of CORHOH: Text CORpus of Holocaust Oral Histories. The corpus consists of 500 oral histories, each narrative form one survivor. The transcripts of the oral histories are retrieved from the Let Them Speak Project [1]. The transcripts are normalized and further annotated. The corpus offers rich metadata about both the testimony givers and the interviews. All technical content is removed, and a unique identifier is assigned to each question (posed by the interviewer) and answer (provided by the survivor). The corpus complies with the TEI guidelines [2]. The corpus includes 106,519 questions and 107,125 answers, making it easy to distinguish between the utterances that belong to the holocaust survivor or anyone else who is involved in the interview, primarily the interviewer. CORHOH is particularly suited for studies on trauma…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
