# Integrating text mining and knowledge graph to enhance biopharmaceutical process optimization

**Authors:** Shovan Bhowmik, Manju Anandakrishnan, Leah Klein, Cecilia Arighi, Marisa Gioioso, Cathy Wu, Austin Brockmeier, K. Vijay-Shanker, Chuming Chen

PMC · DOI: 10.1371/journal.pone.0339197 · PLOS One · 2026-01-14

## TL;DR

This paper introduces a framework using text mining and knowledge graphs to uncover relationships between cell culture conditions and glycosylation in biopharmaceutical manufacturing.

## Contribution

A novel framework combining text mining and knowledge graphs to extract and visualize bioprocess relationships from scientific literature.

## Key findings

- The framework achieves an 88% F1-score in extracting relationships between process parameters and glycan attributes.
- The knowledge graph reveals both direct and indirect bioprocess interactions from fragmented scientific literature.
- The system provides an intuitive web interface for dynamic exploration of bioprocess data.

## Abstract

To guarantee consistent quality of therapeutic proteins, the relationship between manufacturing process parameters and glycosylation profiles must be investigated and understood. The most important manufacturing step to investigate is the cell culture unit operation, where glycoprotein structure is highly dependent on raw materials, cell line genetics, and process control ranges. Because of the critical role glycosylation plays in certain drug mechanisms of action, the relationship between specific process inputs and glycosylation have been documented extensively. However, despite the extensive body of published work, general relationships between different cell culture conditions and glycosylation profiles remain fragmented across diverse studies, hindering systematic analysis and data-driven decision-making. To better elucidate these general relationships from published research, we introduce an innovative framework that leverages text mining and knowledge graph technologies to automatically extract, integrate, and visualize complex relationships from scientific literature, enabling actionable insights for biopharmaceutical process (bioprocess) development. Our methodology centers on the design and development of a specialized text-mining pipeline to extract and quantify relationships between cell culture conditions (raw materials, cell line genetics, and process control ranges) and glycosylation profiles from unstructured scientific literature. To enhance precision, we implement a dual normalization strategy: 1) dictionary-based concept standardization to reconcile term variants, and 2) ontological classification to organize entities into hierarchically structured categories. These curated relationships are then systematically integrated into a knowledge graph, which not only captures direct parameter-outcome associations but also reveals higher-order indirect connection through graph, providing a comprehensive view of bioprocess interactions. We present an intuitive web-based interface that enables researchers to dynamically explore and visualize complex bioprocess relationships through interactive queries. The system demonstrates robust performance with an 88% F1-score in relation extraction, effectively revealing hidden relationships between process parameters and glycan attributes. By combining scalable knowledge graph technology with interpretable analytics, our solution empowers pharmaceutical researchers to optimize therapeutic glycan profiles and accelerate manufacturing process development. This advancement represents a significant step forward in data-driven bioprocess optimization.

## Linked entities

- **Proteins:** glycoprotein (Gn/Gc glycoprotein)

## Full-text entities

- **Chemicals:** glycan (MESH:D011134)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12803468/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12803468/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/PMC12803468/full.md

---
Source: https://tomesphere.com/paper/PMC12803468