Delivering Document Conversion as a Cloud Service with High Throughput   and Responsiveness

Christoph Auer (1); Michele Dolfi (1); Andr\'e Carvalho (2); Cesar; Berrospi Ramis (1); Peter W. J. Staar (1) ((1) IBM Research; (2) SoftINSA; Lda.)

arXiv:2206.00785·cs.DL·July 14, 2022

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

Christoph Auer (1), Michele Dolfi (1), Andr\'e Carvalho (2), Cesar, Berrospi Ramis (1), Peter W. J. Staar (1) ((1) IBM Research, (2) SoftINSA, Lda.)

PDF

1 Repo

TL;DR

This paper presents a scalable cloud-based document conversion service capable of processing over one million PDF pages per hour by optimizing workload distribution and resource management.

Contribution

It introduces a novel scalable architecture for document conversion in the cloud, addressing challenges of high throughput and responsiveness for complex, variable document formats.

Findings

01

Achieved over one million pages per hour throughput

02

Compared two workload distribution strategies and configurations

03

Demonstrated high resource efficiency and scalability

Abstract

Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ds4sd/deepsearch-toolkit
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methodstravel james