DT-BEHRT: Disease Trajectory-aware Transformer for Interpretable Patient Representation Learning
Deyi Li, Zijun Yao, Qi Xu, Muxuan Liang, Lingyao Li, Zijian Xu, Mei Liu

TL;DR
DT-BEHRT introduces a novel graph-enhanced transformer model for EHR data that explicitly models disease trajectories and provides interpretable patient representations, improving predictive accuracy and clinical relevance.
Contribution
The paper presents a new disease trajectory-aware transformer architecture with a tailored pre-training method for better interpretability and performance in EHR-based patient modeling.
Findings
Achieves strong predictive performance on benchmark datasets.
Provides interpretable patient representations aligned with clinical reasoning.
Demonstrates robustness through a novel pre-training strategy.
Abstract
The growing adoption of electronic health record (EHR) systems has provided unprecedented opportunities for predictive modeling to guide clinical decision making. Structured EHRs contain longitudinal observations of patients across hospital visits, where each visit is represented by a set of medical codes. While sequence-based, graph-based, and graph-enhanced sequence approaches have been developed to capture rich code interactions over time or within the same visits, they often overlook the inherent heterogeneous roles of medical codes arising from distinct clinical characteristics and contexts. To this end, in this study we propose the Disease Trajectory-aware Transformer for EHR (DT-BEHRT), a graph-enhanced sequential architecture that disentangles disease trajectories by explicitly modeling diagnosis-centric interactions within organ systems and capturing asynchronous progression…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. The paper conducts experiments on MIMICs across multiple tasks. Beyond quantitative metrics, the authors include patient-level case studies that qualitatively analyze model explanations. S2. Each module in DT-BEHRT, i.e., sequence, aggregation, progression, and patient representation, is clearly motivated from a clinician’s perspective, reflecting real-world medical reasoning. S3. The introduction of two pretraining tasks allows the model to fully leverage EHR data across visits and dis
W1. The paper integrates design patterns from both Transformer-based and graph-based models, resulting in an architecture that appears more incremental. The combination of SR–DA–PR resembles conventional Transformer stacks, with the main variation being the use of ancestor node embeddings and customized losses. Similarly, the SR–DP–PR path largely parallels prior graph-based pipelines that model interactions between disease, visit, and patient nodes. W2. The proposed Global Code Masking and An
1. Demonstrated Modular Contribution: The framework consists of multiple modules (DA, DP) and a new pre-training task (ACP). A key strength is the use of an Ablation Study (Table 3 in the paper) to clearly demonstrate how much each component contributes to the model's performance improvement. 2. Clinical Interpretability: The model doesn't just aim for higher performance; it attempts to link the rationales for its predictions to clinical reasoning (e.g., problems in a specific organ system, tem
1. Limited Dataset Validation: The datasets used for the experiment are limited to MIMIC-III and MIMIC-IV, which is insufficient to prove the model's generalization performance. EHR data has inherent biases depending on the hospital system, country, and ethnicity. Therefore, external validation on other large-scale ICU datasets (e.g., eICU, HiRID, UMCdb) is essential. 2. Lack of Prediction Task Diversity: The variety of prediction tasks performed is insufficient to claim that the proposed frame
* **Clinically-Aligned Architecture:** The model's DA and DP modules are designed to mirror clinical reasoning, enhancing interpretability. * **Novel Pre-training:** The Ancestor Code Prediction (ACP) task effectively aligns the model's different modules with ontology information. * **Strong Empirical Results:** DT-BEHRT outperforms baselines, especially on complex phenotyping and readmission tasks. * **Targeted Ablation:** Ablation studies demonstrate the distinct contributions of the DA and DP
* **Ontology Dependence:** The Disease Aggregation (DA) module is explicitly tied to the ICD-9 ontology, which may not be adaptable. * **Fixed Aggregation Threshold:** DA tokens are activated by a fixed hyperparameter $k$, and the impact of this choice isn't explored. * **Incomplete Pre-train Ablation:** The ablation study does not isolate the effect of the DA token decorrelation loss ($l_{cov}$). * **Simplistic Code Roles:** The model simplifies code roles, treating diagnoses as interactive whi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Electronic Health Records Systems
