# Reliable Access to Massive Restricted Texts: Experience-based Evaluation

**Authors:** Zong Peng, Beth Plale

arXiv: 1903.00771 · 2019-03-05

## TL;DR

This paper evaluates storage solutions for massive restricted digital texts, using HathiTrust as a case study, and recommends Cassandra 3.x for reliable, scalable access.

## Contribution

It identifies key requirements for managing large restricted text corpora and empirically evaluates storage options, leading to a practical deployment decision.

## Key findings

- Cassandra 3.x was chosen as the optimal storage backend.
- The study provides a framework for evaluating storage solutions for large digital texts.
- Deployment of Cassandra enhances reliable access to massive restricted collections.

## Abstract

Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for general access especially under failures depends on the primary storage system. In this paper, we identify the requirements of managing for computational analysis a massive text corpus and use it as basis to evaluate candidate storage solutions. The study based on the 5.9 billion page collection of the HathiTrust digital library. Our findings led to the choice of Cassandra 3.x for the primary back end store, which is currently in deployment in the HathiTrust Research Center.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.00771/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1903.00771/full.md

---
Source: https://tomesphere.com/paper/1903.00771