Typesafe Coordinate Systems in High-Throughput Sequencing Applications
Charles Thomas Gregory, James S. Blachly

TL;DR
This paper proposes a typesafe coordinate system for high-throughput sequencing data to prevent errors caused by incompatible formats, using type systems in programming languages to ensure correctness.
Contribution
It introduces a novel typesafe coordinate model for sequencing data and provides implementations across multiple programming languages to improve data integrity.
Findings
Static guarantees eliminate coordinate-related errors.
Implementations in D, Rust, OCaml, and Python demonstrate practicality.
The approach enhances reliability of sequencing data processing.
Abstract
High-throughput sequencing file formats and tools encode coordinate intervals with respect to a reference sequence in at least four distinct, incompatible ways. Integrating data from and moving data between different formats has the potential to introduce subtle off-by-one errors. Here, we introduce the notion of typesafe coordinates: coordinate intervals are not only an integer pair, but members of a type class comprising four types: the Cartesian product of a zero or one basis, and an open or closed interval end. By leveraging the type system of statically and strongly-typed, compiled languages we can provide static guarantees that an entire class of error is eliminated. We provide a reference implementation in D as part of a larger work (dhtslib), and proofs of concept in Rust, OCaml, and Python. Exploratory implementations are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Genomics and Phylogenetic Studies · Semantic Web and Ontologies
