Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data

Markus Bujotzek; Evelyn Trautmann; Calum Hand; Ian Hales

arXiv:2510.19535·cs.LG·May 7, 2026

Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data

Markus Bujotzek, Evelyn Trautmann, Calum Hand, Ian Hales

PDF

TL;DR

This paper evaluates federated clustering methods for analyzing the diversity of distributed molecular data in pharmaceutical research, emphasizing the importance of domain knowledge and explainability.

Contribution

It benchmarks federated clustering approaches against centralized methods on molecular datasets and introduces chemistry-informed evaluation metrics.

Findings

01

Federated clustering methods can effectively analyze distributed molecular data.

02

Incorporating domain knowledge improves diversity assessment accuracy.

03

Explainability analyses highlight the importance of chemistry-informed metrics.

Abstract

AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.