HySem: A context length optimized LLM pipeline for unstructured tabular   extraction

Narayanan PP; Anantharaman Palacode Narayana Iyer

arXiv:2408.09434·cs.CL·October 8, 2024

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Narayanan PP, Anantharaman Palacode Narayana Iyer

PDF

Open Access

TL;DR

HySem is a novel pipeline that optimizes context length for large language models to accurately extract and semantically represent unstructured tabular data from HTML tables in the pharmaceutical industry.

Contribution

It introduces a context length optimization technique and a custom fine-tuned model tailored for small to medium pharmaceutical enterprises, improving accuracy and handling larger tables efficiently.

Findings

01

HySem outperforms peer open-source models in accuracy.

02

It provides competitive performance against GPT-4o.

03

Addresses context length limitations effectively.

Abstract

Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques