# MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides

**Authors:** Mathilde Rumeau, Marine Courtin, Robert Bossy, Clara Sauvion, Valentin Loux, Mouhamadou Ba, Christelle Knudsen, Sylvie Combes, Claire Nédellec, Louise Deléger

PMC · DOI: 10.1371/journal.pone.0319729 · PLOS One · 2025-08-04

## TL;DR

This paper introduces MilkOligoCorpus, a new annotated dataset for studying milk oligosaccharides in mammals using natural language processing.

## Contribution

MilkOligoCorpus is the first annotated corpus specifically designed for extracting information on milk oligosaccharides.

## Key findings

- MilkOligoCorpus includes 15 abstracts and 15 extracts from PubMed articles annotated with entities and relationships.
- Four terminological resources were developed to support data interoperability and entity identification.
- Baseline information extraction models were tested on the corpus to evaluate its utility.

## Abstract

Milk oligosaccharides are bioactive components that regulate the composition of the neonatal microbiota and exert immunomodulatory functions. Their beneficial effects depend on their structure. Numerous studies have shown intra- and inter-species variation in the structural composition and concentration of these compounds in mammalian milk, yet the biological significance of such variation remains poorly understood. Automated natural language processing methods are promising tools for extracting and gathering structured data from unstructured texts to get insight into the biological significance of milk oligosaccharide variation across mammals. These methods require training and evaluation on manually annotated text corpora. While annotated corpora exist for chemical substances, none are specifically designed for training natural language processing models to extract information on milk oligosaccharides. To this end, we propose MilkOligoCorpus, a new gold standard for milk oligosaccharide composition in mammalian species. MilkOligoCorpus’ annotation scheme is a rich entity/relation model designed to describe the diversity pattern of milk oligosaccharides according to female factor variability and to help better understand the structure-related function of milk oligosaccharides. MilkOligoCorpus consists of abstracts (15) and extracts (15) from 20 full text articles indexed by PubMed annotated with entities related to individuals, samples, oligosaccharides and oligosaccharide quantification linked by binary and n-ary relationships. To address data interoperability across disparate publications and databases, four terminological resources were also developed to assign unique identifiers to the entities, supported by external ontologies. This paper presents the creation of the MilkOligoCorpus and its associated schema, along with the development of annotation guidelines and terminological resources. We also present experimental results obtained by baseline information extraction models on the corpus.

## Full-text entities

- **Diseases:** OS (MESH:C567932)
- **Chemicals:** sugar (MESH:D000073893), 3'SL (-), oligosaccharide (MESH:D009844), Carbohydrates (MESH:D002241), 2'-FL (MESH:C031420), Glycan (MESH:D011134), monosaccharide (MESH:D009005), N-acetylneuraminic acid (MESH:D019158), OS (MESH:D009992)
- **Species:** Escherichia coli (E. coli, species) [taxon 562], Ovis aries (domestic sheep, species) [taxon 9940], Phoca vitulina vitulina (European harbour seal, subspecies) [taxon 51092], Sus scrofa (pig, species) [taxon 9823], Canis lupus familiaris (dog, subspecies) [taxon 9615], Oryctolagus cuniculus (domestic rabbit, species) [taxon 9986], Homo sapiens (human, species) [taxon 9606], Rattus norvegicus (brown rat, species) [taxon 10116], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Bos taurus (bovine, species) [taxon 9913], Ailuropoda melanoleuca (giant panda, species) [taxon 9646], Campylobacter jejuni (species) [taxon 197], Gallus gallus (bantam, species) [taxon 9031], Equus caballus (domestic horse, species) [taxon 9796], Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12321079/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12321079/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12321079/full.md

---
Source: https://tomesphere.com/paper/PMC12321079