Parsing GTF and FASTA files using the eccLib Library
Tomasz Chady, Zuzanna Karolina Filutowska

TL;DR
eccLib is a fast Python-based library for parsing genomic files like GTF and FASTA, offering high-performance genomic context analysis.
Contribution
The novel contribution is the development of eccLib, a high-performance C-based library for genomic file parsing in Python.
Findings
eccLib is the fastest Python-based solution for parsing genomic files.
The library is implemented in C, enabling optimizations not possible in Python.
It supports parsing GTF/GFFv3 and FASTA files with additional analysis methods.
Abstract
Leveraging the Python/C API, eccLib was developed as a high-performance library designed for parsing genomic files and analysing genomic contexts. To the best of the authors’ knowledge, it is the fastest Python-based solution available. With eccLib, users can efficiently parse GTF/GFFv3 and FASTA files and utilize the provided methods for additional analysis. This library is implemented in C and distributed under the GPL-3.0 licence. It is compatible with any system that has the Python interpreter (CPython) installed. The use of C enables numerous optimizations at both the implementation and algorithmic levels, which are either unachievable or impractical in Python.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
