Navigating in a sea of repeats in RNA-seq without drowning
Gustavo Sacomoto, Blerina Sinaimeri, Camille Marchet, Vincent Miele,, Marie-France Sagot, Vincent Lacroix

TL;DR
This paper introduces a formal model for high copy number repeats in RNA-seq data, demonstrating the NP-completeness of identifying repeat-associated subgraphs, and proposes an algorithm to effectively assemble alternative splicing events outside repetitive regions.
Contribution
It provides a formal model for repeats in RNA-seq, proves the complexity of identifying repeat subgraphs, and offers an algorithm to improve local assembly of splicing events.
Findings
NP-completeness of identifying repeat subgraphs in de Bruijn graphs
Algorithm to identify alternative splicing events outside repeats
Validation results on synthetic and real data
Abstract
The main challenge in de novo assembly of NGS data is certainly to deal with repeats that are longer than the reads. This is particularly true for RNA- seq data, since coverage information cannot be used to flag repeated sequences, of which transposable elements are one of the main examples. Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. The results of this work are twofold. First, we introduce a formal model for repre- senting high copy number repeats in RNA-seq data and exploit its properties for inferring a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying in a de Bruijn graph a subgraph with this charac- teristic is NP-complete. In a second step, we show that in the specific case of a local assembly of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
