Data Lakes for Digital Humanities

J\'er\^ome Darmont (ERIC); C\'ecile Favre (ERIC); Sabine Loudcher; (ERIC); Camille No\^us

arXiv:2012.02454·cs.DB·December 7, 2020

Data Lakes for Digital Humanities

J\'er\^ome Darmont (ERIC), C\'ecile Favre (ERIC), Sabine Loudcher, (ERIC), Camille No\^us

PDF

TL;DR

This paper advocates using data lakes to manage diverse data formats in Digital Humanities, highlighting ongoing projects and lessons learned to address data siloing and variety challenges.

Contribution

It introduces the application of data lakes in Digital Humanities, demonstrating their potential to handle complex, heterogeneous data sources effectively.

Findings

01

Data lakes help integrate diverse humanities data formats.

02

Collaborative projects reveal practical benefits and challenges.

03

Lessons learned inform future Digital Humanities data management.

Abstract

Traditional data in Digital Humanities projects bear various formats (structured, semi-structured, textual) and need substantial transformations (encoding and tagging, stemming, lemmatization, etc.) to be managed and analyzed. To fully master this process, we propose the use of data lakes as a solution to data siloing and big data variety problems. We describe data lake projects we currently run in close collaboration with researchers in humanities and social sciences and discuss the lessons learned running these projects.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.