# An end-to-end framework for data lineage analysis covering link pattern recognition, fault diagnosis, and early warning

**Authors:** Rongxu Hou, Shaobo Zhang, Hongjiang Wang, Siwei Li, Yiying Zhang

PMC · DOI: 10.1038/s41598-025-34522-1 · Scientific Reports · 2026-01-07

## TL;DR

This paper introduces a framework for analyzing data lineage to detect and predict data link failures in complex platforms using graph structures and deep learning.

## Contribution

The novel contribution is an end-to-end framework combining graph neural networks and temporal convolutional networks for link pattern recognition, fault diagnosis, and adaptive warning.

## Key findings

- EEFL achieves 92.73% average accuracy in fault classification across datasets.
- The framework outperforms traditional methods in fault detection and reduces false alarms.
- Dynamic threshold warning mechanism adapts using Bayesian optimization and online learning.

## Abstract

With the increasing complexity of data platforms, achieving real-time prediction and tracing of data link failures has become a critical issue that needs to be addressed. We proposes an End-to-End Full-Link intelligent analysis framework (EEFL) based on data lineage. This framework combines graph structures with deep learning algorithms to achieve link pattern recognition and fault warning. First, a dynamic data lineage graph model is constructed and topological features are extracted using a graph neural network (GNN). Through temporal edge weight optimization and semi-supervised clustering, typical link patterns are automatically classified. Second, a hybrid fault diagnosis model is designed, using a temporal convolutional network (TCN) to capture long-term dependencies between link metrics and combining it with a GNN to analyze topological mutations. This model accurately classifies various fault types, including data outages, latency anomalies, and data contamination. Finally, a dynamic threshold warning mechanism is introduced, combining Bayesian optimization and online learning to adaptively adjust alarm triggering conditions and effectively reduce false alarm rates. We verifies the generalization ability of the model using actual enterprise data and simulation data. Experimental results show that EEFL can achieve an average Acc of 92.73% across two datasets, which is significantly better than traditional methods and provides intelligent decision for data governance.

## Full-text entities

- **Genes:** MAT1A (methionine adenosyltransferase 1A) [NCBI Gene 4143] {aka MAT, MATA1, SAMS, SAMS1}, TTC41P (tetratricopeptide repeat domain 41, pseudogene) [NCBI Gene 253724] {aka GNN, GNNP}
- **Diseases:** FTRL (MESH:C537491), MTGNN (MESH:D000377), WGCN (MESH:D015431), TCNs (MESH:C536956), GNNs (MESH:D015441)
- **Chemicals:** GAT (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12864967/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12864967/full.md

---
Source: https://tomesphere.com/paper/PMC12864967