Robustness of Structured Data Extraction from In-plane Rotated Documents   using Multi-Modal Large Language Models (LLM)

Anjanava Biswas; Wrick Talukdar

arXiv:2406.10295·cs.CL·June 18, 2024·5 cites

Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM)

Anjanava Biswas, Wrick Talukdar

PDF

Open Access

TL;DR

This paper examines how in-plane document rotation affects data extraction accuracy of multi-modal LLMs, identifies safe rotation angles, and proposes methods to improve robustness against skew in real-world document processing.

Contribution

It provides a comprehensive analysis of skew impact on multi-modal LLMs and introduces new approaches to enhance their robustness to document rotation.

Findings

01

Skew significantly reduces extraction accuracy across models

02

Safe in-plane rotation angles vary per model

03

Proposed architectures improve skew robustness

Abstract

Multi-modal large language models (LLMs) have shown remarkable performance in various natural language processing tasks, including data extraction from documents. However, the accuracy of these models can be significantly affected by document in-plane rotation, also known as skew, a common issue in real-world scenarios for scanned documents. This study investigates the impact of document skew on the data extraction accuracy of three state-of-the-art multi-modal LLMs: Anthropic Claude V3 Sonnet, GPT-4-Turbo, and Llava:v1.6. We focus on extracting specific entities from synthetically generated sample documents with varying degrees of skewness. The results demonstrate that document skew adversely affects the data extraction accuracy of all the tested LLMs, with the severity of the impact varying across models. We identify the safe in-plane rotation angles (SIPRA) for each model and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus