ArcGPT: A Large Language Model Tailored for Real-world Archival   Applications

Shitou Zhang; Jingrui Hou; Siyuan Peng; Zuchao Li; Qibiao Hu; Ping; Wang

arXiv:2307.14852·cs.CL·July 28, 2023·2 cites

ArcGPT: A Large Language Model Tailored for Real-world Archival Applications

Shitou Zhang, Jingrui Hou, Siyuan Peng, Zuchao Li, Qibiao Hu, Ping, Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

ArcGPT is a pioneering large language model specifically designed for archival applications, trained on extensive archival data, and evaluated on a new benchmark, demonstrating superior performance over existing models.

Contribution

This paper introduces ArcGPT, the first LLM tailored for archival tasks, along with the AMBLE benchmark for real-world archival data evaluation.

Findings

01

ArcGPT outperforms existing models on archival tasks.

02

Pre-training on archival data improves model effectiveness.

03

AMBLE benchmark facilitates future archival LLM research.

Abstract

Archives play a crucial role in preserving information and knowledge, and the exponential growth of such data necessitates efficient and automated tools for managing and utilizing archive information resources. Archival applications involve managing massive data that are challenging to process and analyze. Although LLMs have made remarkable progress in diverse domains, there are no publicly available archives tailored LLM. Addressing this gap, we introduce ArcGPT, to our knowledge, the first general-purpose LLM tailored to the archival field. To enhance model performance on real-world archival tasks, ArcGPT has been pre-trained on massive and extensive archival domain data. Alongside ArcGPT, we release AMBLE, a benchmark comprising four real-world archival tasks. Evaluation on AMBLE shows that ArcGPT outperforms existing state-of-the-art models, marking a substantial step forward in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stzhang-patrick/arcmmlu
pytorch

Datasets

patrickshitou/ArcMMLU
dataset· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Digital and Traditional Archives Management · Data Quality and Management