# Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies

**Authors:** Hiroki Matsui, Kiyohide Fushimi, Hideo Yasunaga

PMC · DOI: 10.1186/s12874-025-02549-7 · BMC Medical Research Methodology · 2025-04-11

## TL;DR

This paper introduces a new method using distributed representations of administrative claims data to adjust for unmeasured confounders in clinical studies.

## Contribution

The novel approach uses word2vec to compress high-dimensional Japanese claims data for better risk adjustment in observational studies.

## Key findings

- Distributed representations improved covariate balance by reducing unmeasured confounders in heart failure studies.
- Combining traditional and novel methods (Model 4) showed no increased bias compared to the true model.
- The method was successfully applied to re-evaluate the effect of early rehabilitation in heart failure patients.

## Abstract

Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders.

Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4).

Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results.

Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.

The online version contains supplementary material available at 10.1186/s12874-025-02549-7.

## Linked entities

- **Diseases:** heart failure (MONDO:0005252)

## Full-text entities

- **Diseases:** HF (MESH:D006333)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11987422/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11987422/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC11987422/full.md

---
Source: https://tomesphere.com/paper/PMC11987422