Arctic-Extract Technical Report
Mateusz Chili\'nski, Julita O{\l}tusek, Wojciech Ja\'skowski

TL;DR
Arctic-Extract is a lightweight, state-of-the-art model for extracting structured data from business documents, capable of processing long documents efficiently on resource-limited hardware.
Contribution
The paper introduces Arctic-Extract, a resource-efficient model with strong performance for document understanding tasks, suitable for deployment on constrained devices.
Findings
Achieves high accuracy in extracting data from scanned and digital documents.
Can process up to 125 A4 pages on limited hardware.
Maintains strong performance despite low resource requirements.
Abstract
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Natural Language Processing Techniques
