# HABiC: an algorithm based on the exact computation of the Kantorovich-Rubinstein optimizer for binary classification in transcriptomics

**Authors:** Chiara Cordier, Pascal Jézéquel, Mario Campone, Fabien Panloup, Agnes Basseville

PMC · DOI: 10.1093/bioinformatics/btaf310 · 2025-05-19

## TL;DR

This paper introduces HABiC, a new machine learning algorithm that improves precision in transcriptomics data analysis using the Wasserstein distance and Kantorovich-Rubinstein optimizer.

## Contribution

The novel contribution is a binary classification algorithm based on exact computation of the Kantorovich-Rubinstein optimizer for transcriptomics data.

## Key findings

- HABiC outperformed state-of-the-art algorithms on synthetic datasets with complex variable relationships.
- The algorithm achieved higher accuracy in predicting clinical outcomes from transcriptomics data.
- Exact and approximate Wasserstein-based methods showed better performance than Euclidean distance classifiers.

## Abstract

Machine learning analyses of molecular omics datasets largely drive the development of precision medicine in oncology, but mathematical challenges still hamper their application in the clinic. In particular, omics-based learning relies on high dimensional data with high degrees of freedom and multicollinearity issues, requiring more tailored algorithms. Here, we have developed a prediction algorithm that relies on the 1-Wasserstein distance to better capture complex relationships between variables, and that is built on a decision rule based on the exact computation of the Kantorovich-Rubinstein optimizer to increase the algorithm precision. We explored dimension reduction and aggregation methods to improve its robustness. The exact method was compared with a neural network-based approximate method, as well as with standard Euclidean distance-based classifiers.

Experimental results on synthetic datasets with multiple scenarios of redundant/informative variables revealed that exact and approximate methods based on Wasserstein distance outperformed state-of-the-art algorithms when class information was spread across a large number of variables. When predicting clinical or biological outcomes from transcriptomics datasets, HABiC achieved consistently higher accuracy in most situations.

Python code for the HABiC classifier is available at https://github.com/chiaraco/HABiC.

## Full-text entities

- **Diseases:** DL (MESH:D007859), Lung cancer (MESH:D008175), PVL (MESH:D054973), Breast cancer (MESH:D001943), cancer (MESH:D009369)
- **Chemicals:** HABiC (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12198494/full.md

---
Source: https://tomesphere.com/paper/PMC12198494