HySem: A context length optimized LLM pipeline for unstructured tabular extraction
Narayanan PP, Anantharaman Palacode Narayana Iyer

TL;DR
HySem is a novel pipeline that optimizes context length for large language models to accurately extract and semantically represent unstructured tabular data from HTML tables in the pharmaceutical industry.
Contribution
It introduces a context length optimization technique and a custom fine-tuned model tailored for small to medium pharmaceutical enterprises, improving accuracy and handling larger tables efficiently.
Findings
HySem outperforms peer open-source models in accuracy.
It provides competitive performance against GPT-4o.
Addresses context length limitations effectively.
Abstract
Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
