# MixingDTA: improved drug–target affinity prediction by extending mixup with guilt-by-association

**Authors:** Youngoh Kim, Dongmin Bang, Bonil Koo, Jungseob Yi, Changyun Cho, Jeonguk Choi, Sun Kim

PMC · DOI: 10.1093/bioinformatics/btaf238 · 2025-07-15

## TL;DR

This paper introduces MixingDTA, a new framework that improves drug–target affinity predictions using data augmentation and pre-trained models, helping overcome data scarcity in drug discovery.

## Contribution

MixingDTA introduces a novel data augmentation strategy, GBA-Mixup, and combines it with a pre-trained model to improve DTA prediction accuracy.

## Key findings

- MEETA model alone improves DTA prediction accuracy by up to 19% over existing methods.
- GBA-Mixup further enhances accuracy by up to 8.4% and works across different models.
- MixingDTA generalizes well for unseen drug–target pairs and identifies functionally critical residues.

## Abstract

Drug–target affinity (DTA) prediction is an important regression task for drug discovery, which can provide richer information than traditional drug–target interaction prediction as a binary prediction task. To achieve accurate DTA prediction, quite large amount of data are required for each drug, which is not available as of now. Thus, data scarcity and sparsity is a major challenge. Another important task is “cold-start” DTA prediction for unseen drug or protein. In this work, we introduce MixingDTA, a novel framework to tackle data scarcity by incorporating domain-specific pretrained language models for molecules and proteins with our MEETA (MolFormer and ESM-based Efficient aggregation Transformer for Affinity) model. We further address the label sparsity and cold-start challenges through a novel data augmentation strategy named GBA-Mixup, which interpolates embeddings of neighboring entities based on the guilt-by-association (GBA) principle, to improve prediction accuracy even in sparse regions of DTA space. Our experiments on benchmark datasets demonstrate that the MEETA backbone alone provides up to a 19% improvement of mean squared error over current state-of-the-art baseline, and the addition of GBA-Mixup contributes a further 8.4% improvement. Importantly, GBA-Mixup is model-agnostic, delivering performance gains across all tested backbone models of up to 16.9%. Case studies shows how MixingDTA interpolates between drugs and targets in the embedding space, demonstrating generalizability for unseen drug–target pairs while effectively focusing on functionally critical residues. These results highlight MixingDTA’s potential to accelerate drug discovery by offering accurate, scalable, and biologically informed DTA predictions.

The code for MixingDTA is available at https://github.com/rokieplayer20/MixingDTA.

## Full-text entities

- **Genes:** DDR1 (discoidin domain receptor tyrosine kinase 1) [NCBI Gene 780] {aka CAK, CD167, DDR, EDDR1, HGK2, MCK10}
- **Diseases:** D (MESH:D014808), T (MESH:D001260), AFA (MESH:D001289), MEETA (MESH:D019292)
- **Chemicals:** imatinib (MESH:D000068877), T (MESH:D014316), hydrogen (MESH:D006859), D (MESH:D003903), AFA (-), amino acid (MESH:D000596)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12261493/full.md

---
Source: https://tomesphere.com/paper/PMC12261493