Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data
Markus Bujotzek, Evelyn Trautmann, Calum Hand, Ian Hales

TL;DR
This paper evaluates federated clustering methods for analyzing the diversity of distributed molecular data in pharmaceutical research, emphasizing the importance of domain knowledge and explainability.
Contribution
It benchmarks federated clustering approaches against centralized methods on molecular datasets and introduces chemistry-informed evaluation metrics.
Findings
Federated clustering methods can effectively analyze distributed molecular data.
Incorporating domain knowledge improves diversity assessment accuracy.
Explainability analyses highlight the importance of chemistry-informed metrics.
Abstract
AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
