# Linkage of HIV treatment and population-based surveillance records in rural South Africa: the AHRI Unified Data Platform (AUDP)

**Authors:** Dickman Gareta, Evelyn Lauren, Khumbo Shumba, Cornelius Nattey, Matthew P. Fox, Koleka Mlisana, Matthias Egger, Dorina Onoya, Kobus Herbst, Jacob Bor

PMC · DOI: 10.1186/s13690-026-01849-8 · 2026-02-07

## TL;DR

Researchers in South Africa linked HIV treatment and surveillance data to better understand care access and outcomes in a rural HIV-endemic area.

## Contribution

A high-accuracy graph-based record linkage algorithm was developed to unify four HIV-related data sources in rural South Africa.

## Key findings

- 986,832 records were successfully linked with 92.7% sensitivity and 96.5% positive predictive value.
- 66.5% of HIV-positive adults were currently on ART, and 89.2% of them were virally suppressed.
- Persistent gaps in retention in care and viral suppression were identified in the HIV-endemic region.

## Abstract

Integrating HIV clinical records with population-based surveillance data allows the study of health care seeking behaviours, access to care, and predictors of patient outcomes. We implemented a graph-based record linkage algorithm to deduplicate and link HIV clinical and population-based surveillance records in an HIV-endemic setting in rural South Africa.

We linked four data sources to create the Africa Health Research Institute (AHRI) Unified Data Platform: AHRI’s Health and Demographic Surveillance System (HDSS), AHRI Clinic and Hospital Information System (AHRILink), National Health Laboratory Service (NHLS), and Three Integrated Electronic Registers (TIER.Net) HIV care and treatment records. HDSS data were collected between January 1, 2000, and July 31, 2024, through repeated household surveys of over 140,000 individuals. Clinical and laboratory data were obtained for one hospital and 17 clinics in Hlabisa, KwaZulu-Natal, covering the HDSS surveillance area. We implemented a probabilistic record linkage algorithm trained and validated on a subset of records with national identity numbers. We assessed linkage accuracy, computed descriptive statistics for the linked database, and estimated the HIV care cascade for this population.

A total of 986,832 records were successfully linked across the four databases, achieving a sensitivity of 92.7% and a positive predictive value of 96.5% (F-score=0.95). The average number of records (standard deviation (SD)) in TIER.Net, HDSS, AHRILink and NHLS were 1.18 (0.44),1.05 (0.23),1.13 (0.40), and 5.21 (4.24), respectively. The linked data indicated that 12,293 HDSS resident adults (≥15 years) were living with HIV at some point during the 2022 and 2024 surveillance rounds. Of these, 10,622 (86.4%) had ever sought HIV care in the public sector, of whom 10,492 (98.8%) had ever started ART and 7,065 (66.5%) were currently on ART, of whom 6,301 (89.2%) were virally suppressed(viral load<200 copies/mL).

HIV care and population surveillance records from four data sources were deduplicated and linked with high accuracy, revealing persistent gaps in retention in care and viral suppression in an HIV-endemic region in rural South Africa. The AHRI Unified Data Platform offers the potential to deepen our understanding of HIV epidemiology in a well-described population and to improve services for HIV.

Not applicable.

The online version contains supplementary material available at 10.1186/s13690-026-01849-8.

## Linked entities

- **Species:** Homo sapiens (taxon 9606)

## Full-text entities

- **Diseases:** HIV (MESH:D015658)
- **Species:** Human immunodeficiency virus 1 (no rank) [taxon 11676], Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12977758/full.md

---
Source: https://tomesphere.com/paper/PMC12977758