# Community Resource: Large-Scale Proteogenomics to Refine Wheat Genome Annotations

**Authors:** Delphine Vincent, Rudi Appels

PMC · DOI: 10.3390/ijms25168614 · International Journal of Molecular Sciences · 2024-08-07

## TL;DR

This study uses proteogenomics to improve wheat genome annotations by mapping peptides to gene models and identifying new genes.

## Contribution

A new proteogenomics workflow is introduced to refine wheat gene annotations and discover novel genes.

## Key findings

- 31.4% of wheat High Confidence gene models were validated using proteomics data.
- 6685 peptides mapped to Low Confidence gene models suggest potential promotion to High Confidence status.
- Orphan peptides identified on chromosome 4D indicate possible novel gene discovery.

## Abstract

Triticum aestivum is an important crop whose reference genome (International Wheat Genome Sequencing Consortium (IWGSC) RefSeq v2.1) offers a valuable resource for understanding wheat genetic structure, improving agronomic traits, and developing new cultivars. A key aspect of gene model annotation is protein-level evidence of gene expression obtained from proteomics studies, followed up by proteogenomics to physically map proteins to the genome. In this research, we have retrieved the largest recent wheat proteomics datasets publicly available and applied the Basic Local Alignment Search Tool (tBLASTn) algorithm to map the 861,759 identified unique peptides against IWGSC RefSeq v2.1. Of the 92,719 hits, 83,015 unique peptides aligned along 33,612 High Confidence (HC) genes, thus validating 31.4% of all wheat HC gene models. Furthermore, 6685 unique peptides were mapped against 3702 Low Confidence (LC) gene models, and we argue that these gene models should be considered for HC status. The remaining 2934 orphan peptides can be used for novel gene discovery, as exemplified here on chromosome 4D. We demonstrated that tBLASTn could not map peptides exhibiting mid-sequence frame shift. We supply all our proteogenomics results, Galaxy workflow and Python code, as well as Browser Extensible Data (BED) files as a resource for the wheat community via the Apollo Jbrowse, and GitHub repositories. Our workflow could be applied to other proteomics datasets to expand this resource with proteins and peptides from biotically and abiotically stressed samples. This would help tease out wheat gene expression under various environmental conditions, both spatially and temporally.

## Linked entities

- **Species:** Triticum aestivum (taxon 4565)

## Full-text entities

- **Species:** Triticum aestivum (bread wheat, species) [taxon 4565]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11354340/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11354340/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC11354340/full.md

---
Source: https://tomesphere.com/paper/PMC11354340