# A hybrid unsupervised methodology on artificial intelligence filtering for automatically processing cellular DNA-encoded library (DEL) datasets

**Authors:** Yiran Huang, Xiao Tan, Xiaoyu Li, Feng Xiong, Siu Ming Yiu

PMC · DOI: 10.1093/bioinformatics/btag001 · 2026-01-07

## TL;DR

A new AI-based method improves processing of DNA-encoded library data for drug discovery by accurately identifying hit compounds.

## Contribution

A hybrid unsupervised AI methodology is introduced for efficient and accurate hit identification in noisy cell-based DEL datasets.

## Key findings

- The automated workflow shows high consistency with experimental results across different library sizes.
- The method generalizes well to different target proteins like INSR and TPOR.
- Pre-trained models and datasets are publicly available for further use and validation.

## Abstract

DNA-encoded library (DEL) technology has been developed as a powerful platform for drug development. Live cell-based selection methodologies were recently developed to expedite drug candidate discovery with higher biological relevance. Nevertheless, hit characterization is challenged by prominent background signals of cell-based selections. Therefore, automated data processing streamline compatible with noisy sequencing output is highly desirable.

Herein, we report an innovative automatic method that enables the most promising hit identification from large quantities of cell-based DEL datasets with improved accuracy and efficiency. This processing workflow is based on a comprehensive unsupervised algorithm incorporating data pre-processing, feature extracting and outlier filtering, descriptor-based classification, similarity score ranking, and active compound prediction. We performed methodology development with two DEL selection datasets targeting insulin receptor (INSR) on live cells, from both ∼30 million- and 1.033 billion-membered libraries. The automated scheme has demonstrated high consistency with experimental results as well as self-adaptivity to on-cell DEL datasets with varied library scales. Extended methodology application to cellular thrombopoietin receptor (TPOR) further substantiated the algorithmic generalization capability regarding target proteins. Thus, this approach can serve as a widely applicable workflow automatically differentiating hit compounds and thereby facilitates drug development from candidate discovery.

The complete datasets, source code, and pre-trained models are made available at https://doi.org/10.5281/zenodo.17452392 and https://doi.org/10.5281/zenodo.17569557.

## Linked entities

- **Proteins:** INSR (insulin receptor), MPL (MPL proto-oncogene, thrombopoietin receptor)

## Full-text entities

- **Genes:** MPL (MPL proto-oncogene, thrombopoietin receptor) [NCBI Gene 4352] {aka C-MPL, CD110, MPLV, THCYT2, THPOR, TPOR}, INSR (insulin receptor) [NCBI Gene 3643] {aka CD220, HHF5}

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12836421/full.md

---
Source: https://tomesphere.com/paper/PMC12836421