Arctic-Extract Technical Report

Mateusz Chili\'nski; Julita O{\l}tusek; Wojciech Ja\'skowski

arXiv:2511.16470·cs.CL·November 21, 2025

Arctic-Extract Technical Report

Mateusz Chili\'nski, Julita O{\l}tusek, Wojciech Ja\'skowski

PDF

Open Access

TL;DR

Arctic-Extract is a lightweight, state-of-the-art model for extracting structured data from business documents, capable of processing long documents efficiently on resource-limited hardware.

Contribution

The paper introduces Arctic-Extract, a resource-efficient model with strong performance for document understanding tasks, suitable for deployment on constrained devices.

Findings

01

Achieves high accuracy in extracting data from scanned and digital documents.

02

Can process up to 125 A4 pages on limited hardware.

03

Maintains strong performance despite low resource requirements.

Abstract

Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Topic Modeling · Natural Language Processing Techniques