# Fair molecular feature selection unveils universally tumor lineage-informative methylation sites in colorectal cancer

**Authors:** Xuan Cindy Li, Yuelin Liu, Alejandro A Schäffer, Stephen M Mount, S Cenk Sahinalp

PMC · DOI: 10.1093/bioinformatics/btaf237 · 2025-07-15

## TL;DR

This paper introduces FALAFL, a new method for fairly selecting molecular features in cancer sequencing data, revealing methylation sites informative across diverse colorectal cancer patients.

## Contribution

FALAFL is a novel combinatorial optimization algorithm for fair feature selection in multi-sample sequencing data.

## Key findings

- FALAFL identifies CpG sites well covered across most patients and with high read coverage per patient.
- Selected sites show strong tumor lineage-informativeness across diverse patient profiles.
- Universally informative sites are enriched in inter-CpG island regions.

## Abstract

In the era of precision medicine, performing comparative analysis over diverse patient populations is a fundamental step toward tailoring healthcare interventions. However, the aspect of fairly selecting molecular features across multiple patients is often overlooked.

To address this challenge, we introduce FALAFL (FAir muLti-sAmple Feature seLection), an algorithmic approach based on combinatorial optimization. FALAFL is designed to perform feature selection in sequencing data which ensures a balanced selection of features from all patient samples in a cohort. We have applied FALAFL to the problem of selecting lineage-informative CpG sites within a cohort of colorectal cancer patients subjected to low-coverage single-cell methylation sequencing. Our results demonstrate that FALAFL can rapidly and robustly determine the optimal set of CpG sites, which are each well covered by cells across the vast majority of the patients, while ensuring that in each patient, a large proportion of these sites have high read coverage. An analysis of the FALAFL-selected sites reveals that their tumor lineage-informativeness exhibits a strong correlation across a spectrum of diverse patient profiles. Furthermore, these universally lineage-informative sites are highly enriched in the inter-CpG island regions. We hope that FALAFL will aid in designing panels for diagnostic and prognostic purposes and help propel fair data science practices in the exploration of complex diseases.

The source code is available at: https://github.com/algo-cancer/FALAFL.

## Linked entities

- **Diseases:** colorectal cancer (MONDO:0005575)

## Full-text entities

- **Diseases:** tumor (MESH:D009369), colorectal cancer (MESH:D015179)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12261425/full.md

---
Source: https://tomesphere.com/paper/PMC12261425