The Anatomy of a Search and Mining System for Digital Archives

Martyn Harris; Mark Levene; Dell Zhang; Dan Levene

arXiv:1603.07150·cs.DL·March 24, 2016·1 cites

The Anatomy of a Search and Mining System for Digital Archives

Martyn Harris, Mark Levene, Dell Zhang, Dan Levene

PDF

Open Access

TL;DR

Samtla is a digital humanities system that enables language-agnostic approximate phrase search and document comparison using a character-based n-gram model, supporting textual analysis and pattern discovery in large corpora.

Contribution

The paper introduces Samtla, a novel digital humanities tool employing character-based n-gram models and suffix trees for flexible, language-independent text retrieval and analysis.

Findings

01

Effective language-agnostic search with high flexibility

02

Successful case studies demonstrating practical utility

03

Positive evaluation of ranking performance through crowdsourcing

Abstract

Samtla (Search And Mining Tools with Linguistic Analysis) is a digital humanities system designed in collaboration with historians and linguists to assist them with their research work in quantifying the content of any textual corpora through approximate phrase search and document comparison. The retrieval engine uses a character-based n-gram language model rather than the conventional word-based one so as to achieve great flexibility in language agnostic query processing. The index is implemented as a space-optimised character-based suffix tree with an accompanying database of document content and metadata. A number of text mining tools are integrated into the system to allow researchers to discover textual patterns, perform comparative analysis, and find out what is currently popular in the research community. Herein we describe the system architecture, user interface, models and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Topic Modeling