# Automated Execution of Data Pipelines based on Configuration Files

**Authors:** Károly Bósa, Paul Heinzlreiter, Uta Störl, Valerie Restat, Dr. Virendra Kumar Tiwari

PMC · DOI: 10.12688/openreseurope.21019.1 · Open Research Europe · 2025-09-19

## TL;DR

This paper introduces a flexible framework for automating data preparation tasks using configuration files, reducing repetitive work and improving code reuse.

## Contribution

A novel configuration-driven framework for modular and reusable data pipeline execution with support for iterative constructs.

## Key findings

- The framework improves code reuse and maintainability by using a declarative configuration approach.
- Cyclomatic complexity analysis shows reduced development effort in real-world data engineering scenarios.
- The tool supports loops and limited recursion for handling complex workflows.

## Abstract

Data preparation is a fundamental aspect of data engineering, a prerequisite for later tasks such as data visualization, reporting, and training machine learning models. Despite the recurring patterns in data transformation processes, the specific steps often vary depending on the project context, data sources, and application domain.

To address these challenges, this paper presents a flexible and extensible framework that enables the coordinated execution of modular data processing steps defined in a configuration file. By adopting a declarative, configuration-driven approach, the framework promotes modular, step-by-step development while substantially improving code reuse, maintainability, and adaptability. The framework also supports basic iterative execution constructs, such as loops and limited recursion, within the data pipeline definitions to accommodate more complex workflows.

By enabling the reuse of existing code snippets, the framework shifts development efforts toward enhancing and refining a shared code base, rather than repeatedly creating project-specific, disposable implementations. The long-term benefits of this approach become increasingly apparent as the system evolves. As more generalized modules and functions are developed, they can reduce duplication and improve maintainability without sacrificing flexibility.

To assess the effectiveness of the framework, we apply cyclomatic complexity as a metric, demonstrating how the proposed approach impacts the development effort across some relatively simple, real-world data engineering scenarios.

This paper presents a flexible software tool that helps to organize and automate data preparation steps for e.g. reports, visualizations or machine learning. Instead of writing individual program code for each project, users can easily link different data processing steps as building blocks by listing them in a configuration file. This makes it easier to reuse, update and adapt the process to different needs as well as saves time and reduces repetitive work. The tool also encourages the creation of a shared library of reusable processing parts.

## Full-text entities

- **Genes:** CD46 (CD46 molecule) [NCBI Gene 4179] {aka AHUS2, MCP, MIC10, TLX, TRA2.10}
- **Diseases:** ETL (MESH:C536761)
- **Chemicals:** YAML (-), aluminium (MESH:D000535)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12784049/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12784049/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/PMC12784049/full.md

---
Source: https://tomesphere.com/paper/PMC12784049