Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health

Jingwei Huang; Kuroush Nezafati; Zhikai Chi; Ruichen Rong; Colin Treager; Tingyi Wanyan; Yueshuang Xu; Xiaowei Zhan; Patrick Leavey; Guanghua Xiao; Wenqi Shi; Yang Xie

arXiv:2603.20435·cs.AI·April 21, 2026

Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health

Jingwei Huang, Kuroush Nezafati, Zhikai Chi, Ruichen Rong, Colin Treager, Tingyi Wanyan, Yueshuang Xu, Xiaowei Zhan, Patrick Leavey, Guanghua Xiao, Wenqi Shi, Yang Xie

PDF

TL;DR

This paper introduces deep reflective reasoning, a novel LLM framework that iteratively checks and revises structured data extraction from clinical notes, significantly improving accuracy and consistency across multiple oncology applications.

Contribution

It presents a new LLM-based method that enforces interdependence constraints through iterative self-critique, enhancing reliability of clinical data extraction.

Findings

01

F1 score for categorical variables increased from 0.828 to 0.911.

02

Accuracy for immunostaining pattern improved from 0.870 to 0.927.

03

Tumor staging accuracy increased from 0.680 to 0.833.

Abstract

Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)-based extraction pipelines often struggle to capture these dependencies, leading to clinically inconsistent outputs. We propose deep reflective reasoning, a large language model agent framework that iteratively self-critiques and revises structured outputs by checking consistency among variables, the input text, and retrieved domain knowledge, stopping when outputs converge. We extensively evaluate the proposed method in three diverse oncology applications: (1) On colorectal cancer synoptic reporting from gross descriptions (n=217), reflective reasoning improved average F1 across eight categorical synoptic variables from 0.828 to 0.911 and increased mean correct rate across four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.