IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data

Dong Xu; Zhangfan Yang; Jenna Xinyi Yao; Shuangbao Song; Zexuan Zhu; Junkai Ji

arXiv:2508.10775·cs.LG·August 15, 2025

IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data

Dong Xu, Zhangfan Yang, Jenna Xinyi Yao, Shuangbao Song, Zexuan Zhu, Junkai Ji

PDF

3 Reviews

TL;DR

IBEX is a novel coarse-to-fine molecular generation pipeline that leverages information-bottleneck theory to improve transferability and success rates in structure-based drug discovery under limited data conditions.

Contribution

It introduces an information-theoretic analysis to guide masking strategies and enhances a target architecture with a physics-based refinement step, significantly improving docking success and molecule quality.

Findings

01

Increases zero-shot docking success rate from 53% to 64%.

02

Improves mean Vina score from -7.41 to -8.07 kcal/mol.

03

Achieves state-of-the-art validity and diversity in generated molecules.

Abstract

Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training-set biases. As such, we present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design. Specifically, we use PAC-Bayesian information-bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 2

Strengths

1. Innovative Information Bottleneck Approach: The paper introduces a novel application of the PAC-Bayes information bottleneck framework to molecular generation tasks, providing a theoretical basis for understanding the information density of different generation paradigms. 2. The proposed framework fully utilizes prior chemical information and achieves good results in terms of docking success rates, binding energies and so on, demonstrating its effectiveness in generating high-quality molecul

Weaknesses

1. The decoupled approach and the use of multiple modules (e.g., scaffold hopping, physical refinement) may increase the complexity of the overall framework, potentially making it harder to implement and fine-tune compared to more integrated methods. 2. The method appears to be a mere adjustment of the training approach based on TargetDiff. It would be more convincing if its performance could be validated across other diffusion architectures as well (e.g. molcraft or else), which might better s

Reviewer 02Rating 4Confidence 4

Strengths

1. The theoretical analysis of the difficulties of each problem setting is a very interesting approach, and helps give some much-needed direction to the field of pocket-conditioned molecular generation. I think the metrics that the authors use make sense given the setup, especially looking at the gradient-SNR. 2. The proposed splitting of the training and inference tasks is interesting, and seems to perform quite well. 3. The generated molecules look reasonable and have relatively strong Vina sc

Weaknesses

In general, I think the weaknesses of this paper come from the definitions that were used to define the three tasks. The de novo task is realistic, but the SC and SH tasks are not very realistic in my opinion. First, SC would be somewhat close to a lead optimization task, but usually in lead optimization you already have side chains and you want to modify or expand them to increase activity. Second, SH is not realistic because scaffold hopping involves searching for both a new scaffold and new

Reviewer 03Rating 2Confidence 5

Strengths

1. Shows SH is the most information-dense task and has the highest means across interaction/geometry axes and the broadest distributions (i.e., more informative training signals) with PAC-Bayes information-bottleneck view.SH exposes richer, more discriminative signal (higher information density) than DN/SC, tightening generalization bounds. Training on SH therefore structures the latent space so it transfers better when fully masking at test time (DN).

Weaknesses

1. lacks reference papers related to BM-scaffold hopping generative models, recent de-novo generative models as well as performance comparison between DiffHopp: A Graph Diffusion Model for Novel Drug Design via Scaffold Hopping (ICML 2023 Workshop) TurboHopp: Accelerated Molecule Scaffold Hopping with Consistency Models (Neurips 2024) Why not implement these models above for SH stage? 2. Figures are hard to understand. Typos(MOLCARFT -> MOLCRAFT) and the paper citation formats need improveme

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.