PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Hui Sun; Yanfeng Ding; Liping Yi; Huidong Ma; Gang Wang; Xiaoguang Liu; Cheng Zhong; Wentong Cai

arXiv:2507.12805·cs.LG·July 18, 2025

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Hui Sun, Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai

PDF

Open Access

TL;DR

PMKLC introduces a parallel, multi-knowledge learning-based lossless compressor optimized for large-scale genomic data, significantly improving compression ratio, throughput, and robustness over existing methods.

Contribution

The paper presents a novel GPU-accelerated, multi-knowledge learning framework with parallel mechanisms and adaptable modes for efficient, robust lossless genomic data compression.

Findings

01

Achieves up to 73.6% better compression ratio.

02

Up to 10.7 times higher throughput.

03

Demonstrates superior robustness and resource efficiency.

Abstract

Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ( $s$ , $k$ )-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Algorithms and Data Compression