Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models
Xiao Fang, Ming L\"u, Hanwen Liang, Xingshen Song, Kele Xu, Hui Cai, Chaofan Zhang

TL;DR
This paper introduces JSG-IE, a JSON-schema guided method that enhances the extraction of complex, nested device data from literature using large language models, significantly improving accuracy without fine-tuning.
Contribution
The authors propose a schema-constrained generation approach for LLMs to reliably extract deeply structured data, outperforming conventional prompting methods across multiple models.
Findings
Performance improved by 5.7% with schema guidance over conventional prompts.
Highest F1 score of 83.4% achieved by Kimi-k2-thinking model.
Mid-tier and open-source models saw up to 24.1% F1 score gains.
Abstract
The rapid advancement of Large Language Models has transformed scientific research workflows, including enabling the automated extraction of data directly from published literature. Most existing efforts, however, focus on extracting simple labeled key-value entities, whereas many scientific applications require more complex, hierarchically structured data. A representative example is Quantum Cascade Lasers, whose device architectures are defined by tens of interdependent parameters organized in nested layer sequences. In this work we propose a \emph{JSON-Schema Guided Information Extraction Pipeline} (JSG-IE) that enables reliable extraction of deeply structured device data without model fine-tuning. By transforming extraction into a schema-constrained generation task, our approach significantly improves structural consistency and accuracy. Across 12 state-of-the-art LLMs, a properly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
