The Vendiscope: An Algorithmic Microscope For Data Collections
Amey P. Pasarkar, Adji Bousso Dieng

TL;DR
The Vendiscope is a novel algorithmic microscope that uses diversity metrics to analyze large data collections, revealing redundancies, model limitations, and memorization patterns across various scientific domains.
Contribution
It introduces the Vendiscope, a new computational tool that extends microscopy to data analysis using differentiable diversity metrics for high-resolution insights.
Findings
Identified over 200 million near-duplicate protein sequences in the protein universe.
Discovered that AlphaFold struggles with proteins contributing most to diversity.
Found that most crystals in the Materials Project are near-duplicates, affecting ML performance.
Abstract
The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials…
Peer Reviews
Decision·Submitted to ICLR 2026
(1) This paper introduces a unified, scalable framework to quantify each datapoint’s contribution to dataset diversity across multiple domains. (2) It effectively bridges data analysis and model diagnosis by revealing redundancy, rarity, and memorization patterns using a single interpretable metric.
See questions
+ The paper is overall easy to understand. + The paper conducted a rich amount of experiments.
- The main weakness lies in the contribution of the paper. Using vendi scoring to quantify the diversity of datasets does not seem like a significant contribution. The paper neither leads to any new finding using this scheme. It is well known that Alphafold struggles to predict structure for proteins with low homologs. The generative model’s memorization aspect is also well-known. - It is not clear why CIFAR-10 is used for images, it is a small image dataset with limited diversity. The images
Demonstrates applicability over three different domains (one in appendix). Findings have relevance/applicability to each domain, eg strong clustering results on protein sequences relative to MMseqs2, and proper evaluation metrics in the presences of duplicates for image generative modeling.
Would be interested to see (possibly deferred to the appendix) slightly more discussion on any potential issues in generating the Vendi scores. For instance, how was it decided whether to use q=0.1 or 0.5, and how sensitive were the results to the value of q. Or, how sensitive are the results to the particular chosen data embedding?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnatomy and Medical Technology
MethodsOntology · AlphaFold
