Unsupervised Co-Learning on $\mathcal{G}$-Manifolds Across Irreducible Representations
Yifeng Fan, Tingran Gao, Zhizhen Zhao

TL;DR
This paper presents a new unsupervised co-learning approach for manifolds with group actions, leveraging multiple irreducible representations to improve tasks like nearest neighbor search and community detection.
Contribution
It introduces a novel representation theoretic framework for manifold co-learning across irreducible group representations, enhancing unsupervised learning on structured manifolds.
Findings
Improved robustness in nearest neighbor search.
Enhanced community detection in cryo-electron microscopy images.
Effective use of multiple irreducible representations for manifold learning.
Abstract
We introduce a novel co-learning paradigm for manifolds naturally equipped with a group action, motivated by recent developments on learning a manifold from attached fibre bundle structures. We utilize a representation theoretic mechanism that canonically associates multiple independent vector bundles over a common base manifold, which provides multiple views for the geometry of the underlying manifold. The consistency across these fibre bundles provide a common base for performing unsupervised manifold co-learning through the redundancy created artificially across irreducible representations of the transformation group. We demonstrate the efficacy of the proposed algorithmic paradigm through drastically improved robust nearest neighbor search and community detection on rotation-invariant cryo-electron microscopy image analysis.
| method | clusters | clusters | |||||
|---|---|---|---|---|---|---|---|
| SO(2) | Scalar | 0.569 0.069 | 0.705 0.092 | 0.837 0.059 | 0.868 0.010 | 0.948 0.015 | 0.981 0.013 |
| VDM | 0.526 0.036 | 0.644 0.076 | 0.857 0.057 | 0.892 0.010 | 0.963 0.011 | 0.994 0.008 | |
| Power spec. (ours) | 0.670 0.065 | 0.899 0.051 | 0.981 0.021 | 0.975 0.010 | 0.991 0.011 | 0.998 0.006 | |
| Opt (ours) | 0.687 0.011 | 0.912 0.009 | 0.986 0.007 | 0.976 0.012 | 0.994 0.008 | 0.997 0.005 | |
| Bispec. (ours) | 0.664 0.073 | 0.901 0.062 | 0.983 0.019 | 0.967 0.014 | 0.997 0.003 | 1 0 | |
| SO(3) | Scalar | 0.572 0.061 | 0.666 0.095 | 0.862 0.056 | 0.838 0.003 | 0.838 0.007 | 0.909 0.019 |
| VDM | 0.600 0.048 | 0.840 0.056 | 0.974 0.023 | 0.850 0.011 | 0.919 0.013 | 0.965 0.014 | |
| Power spec. (ours) | 0.921 0.038 | 0.986 0.016 | 1 0 | 0.874 0.011 | 0.939 0.011 | 0.981 0.017 | |
| Bispec. (ours) | 0.911 0.043 | 0.990 0.010 | 1 0 | 0.869 0.012 | 0.943 0.009 | 0.979 0.011 | |
| method | Truncation cutoff | ||||||
|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | ||
| VDM | 2.63 | 3.02 | 3.48 | 3.67 | 4.14 | 4.59 | |
| Power spec. (ours) | 2.91 | 4.71 | 5.93 | 7.05 | 9.16 | 12.30 | |
| Opt (ours) | 2.94 | 5.66 | 7.26 | 8.95 | 12.20 | 16.43 | |
| Bispec. (ours) | 2.88 | 5.53 | 7.24 | 8.70 | 11.91 | 16.23 | |
| VDM | 2.82 | 4.60 | 8.05 | 9.46 | 9.25 | 9.13 | |
| Power spec. (ours) | 4.37 | 14.96 | 33.44 | 38.85 | 38.21 | 37.77 | |
| Opt (ours) | 5.64 | 22.65 | 45.70 | 51.59 | 50.93 | 49.77 | |
| Bispec. (ours) | 5.62 | 22.17 | 44.84 | 50.44 | 49.57 | 48.68 | |
| VDM | 3.48 | 8.65 | 17.96 | 27.56 | 24.58 | 20.87 | |
| Power spec. (ours) | 8.29 | 38.22 | 68.09 | 83.04 | 78.92 | 73.56 | |
| Opt (ours) | 15.03 | 52.25 | 77.38 | 87.72 | 86.25 | 82.66 | |
| Bispec. (ours) | 14.95 | 51.19 | 76.57 | 87.33 | 85.56 | 81.90 | |
| VDM | 57.04 | 98.48 | 99.99 | 100 | 100 | 100 | |
| Power spec. (ours) | 99.05 | 100 | 100 | 100 | 100 | 100 | |
| Opt (ours) | 99.60 | 99.99 | 100 | 100 | 100 | 100 | |
| Bispec. (ours) | 99.60 | 100 | 100 | 100 | 100 | 100 | |
| method | Truncation cutoff | ||||||
|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | ||
| VDM | 2.63 | 3.02 | 3.48 | 3.67 | 4.14 | 4.59 | |
| Power spec. (ours) | 2.91 | 4.71 | 5.93 | 7.05 | 9.16 | 12.30 | |
| Opt (ours) | 2.94 | 5.66 | 7.26 | 8.95 | 12.20 | 16.43 | |
| Bispec. (ours) | 2.88 | 5.53 | 7.24 | 8.70 | 11.91 | 16.23 | |
| VDM | 2.82 | 4.60 | 8.05 | 9.46 | 9.25 | 9.13 | |
| Power spec. (ours) | 4.37 | 14.96 | 33.44 | 38.85 | 38.21 | 37.77 | |
| Opt (ours) | 5.64 | 22.65 | 45.70 | 51.59 | 50.93 | 49.77 | |
| Bispec. (ours) | 5.62 | 22.17 | 44.84 | 50.44 | 49.57 | 48.68 | |
| VDM | 3.48 | 8.65 | 17.96 | 27.56 | 24.58 | 20.87 | |
| Power spec. (ours) | 8.29 | 38.22 | 68.09 | 83.04 | 78.92 | 73.56 | |
| Opt (ours) | 15.03 | 52.25 | 77.38 | 87.72 | 86.25 | 82.66 | |
| Bispec. (ours) | 14.95 | 51.19 | 76.57 | 87.33 | 85.56 | 81.90 | |
| VDM | 57.04 | 98.48 | 99.99 | 100 | 100 | 100 | |
| Power spec. (ours) | 99.05 | 100 | 100 | 100 | 100 | 100 | |
| Opt (ours) | 99.60 | 99.99 | 100 | 100 | 100 | 100 | |
| Bispec. (ours) | 99.60 | 100 | 100 | 100 | 100 | 100 | |
| method | Maximum frequency | |||||
|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | ||
| VDM | — 3.67 — | |||||
| Power spec. (ours) | 4.12 | 5.23 | 7.05 | 9.52 | 11.45 | |
| Opt (ours) | 4.06 | 5.39 | 8.95 | 16.59 | 29.17 | |
| Bispec. (ours) | 4.03 | 5.26 | 8.70 | 15.73 | 26.55 | |
| VDM | — 9.46 — | |||||
| Power spec. (ours) | 13.47 | 25.49 | 38.85 | 52.02 | 55.40 | |
| Opt (ours) | 13.28 | 29.74 | 51.59 | 71.21 | 76.30 | |
| Bispec. (ours) | 13.03 | 28.85 | 50.44 | 70.18 | 77.26 | |
| VDM | — 27.56 — | |||||
| Power spec. (ours) | 43.66 | 69.29 | 83.04 | 90.15 | 90.56 | |
| Opt (ours) | 43.42 | 73.69 | 87.72 | 93.01 | 92.07 | |
| Bispec. (ours) | 42.15 | 72.55 | 87.33 | 93.05 | 93.20 | |
| VDM | — 100 — | |||||
| Power spec. (ours) | 100 | 100 | 100 | 100 | 100 | |
| Opt (ours) | 100 | 100 | 100 | 100 | 100 | |
| Bispec. (ours) | 100 | 100 | 100 | 100 | 100 | |
| method | Maximum frequency | |||||
|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | ||
| VDM | — 3.67 — | |||||
| Power spec. (ours) | 4.12 | 5.23 | 7.05 | 9.52 | 11.45 | |
| Opt (ours) | 4.06 | 5.39 | 8.95 | 16.59 | 29.17 | |
| Bispec. (ours) | 4.03 | 5.26 | 8.70 | 15.73 | 26.55 | |
| VDM | — 9.46 — | |||||
| Power spec. (ours) | 13.47 | 25.49 | 38.85 | 52.02 | 55.40 | |
| Opt (ours) | 13.28 | 29.74 | 51.59 | 71.21 | 76.30 | |
| Bispec. (ours) | 13.03 | 28.85 | 50.44 | 70.18 | 77.26 | |
| VDM | — 27.56 — | |||||
| Power spec. (ours) | 43.66 | 69.29 | 83.04 | 90.15 | 90.56 | |
| Opt (ours) | 43.42 | 73.69 | 87.72 | 93.01 | 92.07 | |
| Bispec. (ours) | 42.15 | 72.55 | 87.33 | 93.05 | 93.20 | |
| VDM | — 100 — | |||||
| Power spec. (ours) | 100 | 100 | 100 | 100 | 100 | |
| Opt (ours) | 100 | 100 | 100 | 100 | 100 | |
| Bispec. (ours) | 100 | 100 | 100 | 100 | 100 | |
| method | Truncation | ||||||
|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | ||
| Scalar | 0.828 0.032 | 0.847 0.020 | 0.865 0.017 | 0.853 0.014 | 0.834 0.010 | 0.823 0.012 | |
| VDM | 0.825 0.024 | 0.854 0.021 | 0.879 0.020 | 0.916 0.015 | 0.925 0.016 | 0.912 0.019 | |
| Power spec. (ours) | 0.849 0.022 | 0.938 0.018 | 0.979 0.008 | 0.961 0.010 | 0.955 0.012 | 0.973 0.007 | |
| Opt (ours) | 0.878 0.025 | 0.957 0.016 | 0.966 0.010 | 0.983 0.007 | 0.960 0.009 | 0.975 0.008 | |
| Bispec. (ours) | 0.869 0.019 | 0.948 0.013 | 0.975 0.009 | 0.955 0.014 | 0.957 0.008 | 0.927 0.016 | |
| Scalar | 0.838 0.032 | 0.881 0.024 | 0.958 0.017 | 0.941 0.010 | 0.845 0.028 | 0.830 0.031 | |
| VDM | 0.823 0.027 | 0.903 0.015 | 0.959 0.011 | 0.958 0.011 | 0.962 0.008 | 0.982 0.005 | |
| Power spec. (ours) | 0.894 0.019 | 0.985 0.007 | 0.997 0.002 | 0.996 0.002 | 0.996 0.002 | 0.995 0.003 | |
| Opt (ours) | 0.905 0.020 | 0.993 0.003 | 0.998 0.001 | 0.996 0.001 | 0.997 0.001 | 0.974 0.008 | |
| Bispec. (ours) | 0.895 0.021 | 0.986 0.07 | 0.997 0.002 | 0.996 0.002 | 0.964 0.017 | 0.917 0.024 | |
| Scalar | 0.850 0.016 | 0.913 0.018 | 0.985 0.008 | 0.986 0.009 | 0.864 0.032 | 0.830 0.022 | |
| VDM | 0.854 0.012 | 0.950 0.011 | 0.992 0.008 | 0.993 0.005 | 0.993 0.004 | 0.993 0.005 | |
| Power spec. (ours) | 0.948 0.021 | 0.998 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Opt (ours) | 0.982 0.008 | 0.999 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Bispec. (ours) | 0.952 0.013 | 0.998 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | |
| method | Truncation | ||||||
|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | ||
| Scalar | 0.828 0.032 | 0.847 0.020 | 0.865 0.017 | 0.853 0.014 | 0.834 0.010 | 0.823 0.012 | |
| VDM | 0.825 0.024 | 0.854 0.021 | 0.879 0.020 | 0.916 0.015 | 0.925 0.016 | 0.912 0.019 | |
| Power spec. (ours) | 0.849 0.022 | 0.938 0.018 | 0.979 0.008 | 0.961 0.010 | 0.955 0.012 | 0.973 0.007 | |
| Opt (ours) | 0.878 0.025 | 0.957 0.016 | 0.966 0.010 | 0.983 0.007 | 0.960 0.009 | 0.975 0.008 | |
| Bispec. (ours) | 0.869 0.019 | 0.948 0.013 | 0.975 0.009 | 0.955 0.014 | 0.957 0.008 | 0.927 0.016 | |
| Scalar | 0.838 0.032 | 0.881 0.024 | 0.958 0.017 | 0.941 0.010 | 0.845 0.028 | 0.830 0.031 | |
| VDM | 0.823 0.027 | 0.903 0.015 | 0.959 0.011 | 0.958 0.011 | 0.962 0.008 | 0.982 0.005 | |
| Power spec. (ours) | 0.894 0.019 | 0.985 0.007 | 0.997 0.002 | 0.996 0.002 | 0.996 0.002 | 0.995 0.003 | |
| Opt (ours) | 0.905 0.020 | 0.993 0.003 | 0.998 0.001 | 0.996 0.001 | 0.997 0.001 | 0.974 0.008 | |
| Bispec. (ours) | 0.895 0.021 | 0.986 0.07 | 0.997 0.002 | 0.996 0.002 | 0.964 0.017 | 0.917 0.024 | |
| Scalar | 0.850 0.016 | 0.913 0.018 | 0.985 0.008 | 0.986 0.009 | 0.864 0.032 | 0.830 0.022 | |
| VDM | 0.854 0.012 | 0.950 0.011 | 0.992 0.008 | 0.993 0.005 | 0.993 0.004 | 0.993 0.005 | |
| Power spec. (ours) | 0.948 0.021 | 0.998 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Opt (ours) | 0.982 0.008 | 0.999 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Bispec. (ours) | 0.952 0.013 | 0.998 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | |
| method | Maximum frequency | ||||||
|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | ||
| Scalar | — 0.865 0 — | ||||||
| VDM | — 0.879 0 — | ||||||
| Power spec. (ours) | 0.920 0.019 | 0.958 0.009 | 0.979 0.004 | 0.981 0.004 | 0.965 0.007 | 0.985 0.003 | |
| Opt (ours) | 0.920 0.014 | 0.951 0.009 | 0.957 0.008 | 0.988 0.003 | 0.968 0.005 | 0.993 0.002 | |
| Bispec. (ours) | 0.898 0.025 | 0.960 0.010 | 0.975 0.008 | 0.976 0.007 | 0.989 0.005 | 0.990 0.004 | |
| Scalar | — 0.958 0 — | ||||||
| VDM | — 0.959 0 — | ||||||
| Power spec. (ours) | 0.991 0.003 | 0.974 0.008 | 0.997 0.001 | 0.997 0.001 | 0.999 0.001 | 1 0 | |
| Opt (ours) | 0.970 0.012 | 0.996 0.002 | 0.998 0.001 | 0.998 0.001 | 0.999 0.001 | 0.999 0.001 | |
| Bispec. (ours) | 0.989 0.005 | 0.996 0.002 | 0.997 0.001 | 0.998 0.001 | 1 0 | 1 0 | |
| Scalar | — 0.985 0 — | ||||||
| VDM | — 0.992 0 — | ||||||
| Power spec. (ours) | 0.997 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Opt (ours) | 0.998 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Bispec. (ours) | 0.996 0.002 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | |
| method | Maximum frequency | ||||||
|---|---|---|---|---|---|---|---|
| 2 | 5 | 10 | 20 | 50 | 100 | ||
| Scalar | — 0.865 0 — | ||||||
| VDM | — 0.879 0 — | ||||||
| Power spec. (ours) | 0.920 0.019 | 0.958 0.009 | 0.979 0.004 | 0.981 0.004 | 0.965 0.007 | 0.985 0.003 | |
| Opt (ours) | 0.920 0.014 | 0.951 0.009 | 0.957 0.008 | 0.988 0.003 | 0.968 0.005 | 0.993 0.002 | |
| Bispec. (ours) | 0.898 0.025 | 0.960 0.010 | 0.975 0.008 | 0.976 0.007 | 0.989 0.005 | 0.990 0.004 | |
| Scalar | — 0.958 0 — | ||||||
| VDM | — 0.959 0 — | ||||||
| Power spec. (ours) | 0.991 0.003 | 0.974 0.008 | 0.997 0.001 | 0.997 0.001 | 0.999 0.001 | 1 0 | |
| Opt (ours) | 0.970 0.012 | 0.996 0.002 | 0.998 0.001 | 0.998 0.001 | 0.999 0.001 | 0.999 0.001 | |
| Bispec. (ours) | 0.989 0.005 | 0.996 0.002 | 0.997 0.001 | 0.998 0.001 | 1 0 | 1 0 | |
| Scalar | — 0.985 0 — | ||||||
| VDM | — 0.992 0 — | ||||||
| Power spec. (ours) | 0.997 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Opt (ours) | 0.998 0.001 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | |
| Bispec. (ours) | 0.996 0.002 | 1 0 | 1 0 | 1 0 | 1 0 | 1 0 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopological and Geometric Data Analysis · Face and Expression Recognition · Advanced Clustering Algorithms Research
Unsupervised Co-Learning on -Manifolds Across Irreducible Representations
Yifeng Fan1 Tingran Gao2 Zhizhen Zhao1
1University of Illinois at Urbana-Champaign 2University of Chicago
{yifengf2, zhizhenz}@illinois.edu [email protected]
Abstract
We introduce a novel co-learning paradigm for manifolds naturally admitting an action of a transformation group , motivated by recent developments on learning a manifold from attached fibre bundle structures. We utilize a representation theoretic mechanism that canonically associates multiple independent vector bundles over a common base manifold, which provides multiple views for the geometry of the underlying manifold. The consistency across these fibre bundles provide a common base for performing unsupervised manifold co-learning through the redundancy created artificially across irreducible representations of the transformation group. We demonstrate the efficacy of our proposed algorithmic paradigm through drastically improved robust nearest neighbor identification in cryo-electron microscopy image analysis and the clustering accuracy in community detection.
1 Introduction
Fighting with the curse of dimensionality by leveraging low-dimensional intrinsic structures has become an important guiding principle in modern data science. Apart from classical structural assumptions commonly employed in sparsity or low-rank models in high dimensional statistics [63, 11, 12, 49, 2, 64, 67], recently it has become of interest to leverage more intricate properties of the underlying geometric model, motivated by algebraic or differential geometry techniques, for efficient learning and inference from massive complex datasets [15, 16, 44, 46, 8]. The assumption that high dimensional datasets lie approximately on a low-dimensional manifold, known as the manifold hypothesis, has been the cornerstone for the development of manifold learning [62, 52, 18, 3, 4, 5, 17, 57, 66] in the past few decades.
In many real applications, the low-dimensional manifold underlying the dataset of high ambient dimensionality admits additional structures that can be fully leveraged to gain deeper insights into the geometry of the data. One class of such examples arises in scientific fields such as cryo-electron microscopy (cryo-EM), where large numbers of random projections for a three-dimensional molecule generate massive collections of images that can be determined only up to in-plane rotations [59, 72]. Another source of examples is the application in computer vision and robotics, where a major challenge is to recognize and compare three-dimensional spatial configurations up to the action of Euclidean or conformal groups [28, 10]. In these examples, the dataset of interest consists of images or shapes of potentially high spatial resolution, and admits a natural group action that plays the role of a nuisance or latent variable that needs to be “quotient out” before useful information is revealed.
In geometric terms, on top of a differentiable manifold underlying the dataset of interest, as assumed in the manifold hypothesis, we also assume the manifold admits a smooth right action of a Lie group , in the sense that there is a smooth map satisfying and for all and , where is the identity element of . A left action can be defined similarly. Such a group action reflects abundant information about the symmetry of the underlying manifold, with which one can study geometric and topological properties of the underlying manifold through the lens of the orbit, stabilized, or induced finite- or infinite-dimensional representations of . In modern differential and symplectic geometry literature, a smooth manifold admitting the action of a Lie group is often referred to as a -manifold (see e.g. [40, §6], [50, 1, 33] and references therein), and this transformation-centered methodology has been proven fruitful [42, 53, 40, 30] by several generations of geometers and topologists.
Recent development of manifold learning has started to digest and incorporate the additional information encoded in the -actions on the low-dimensional manifold underlying the high-dimensional data. In [36], the authors constructed a steerable graph Laplacian on the manifold of images — modeled as a rotationally invariant manifold (or -manifold in geometric terms) — that serves the role of graph Laplacian in manifold learning but with naturally built-in rotational invariance by construction. In [38], the authors proposed a principal bundle model for image denoising, which achieved state-of-the-art performance by combining patch-based image analysis with rotationally invariant distances in microscopy [47]. A major contribution of this paper is to provide deeper insights into the success of these group-transformation-based manifold learning techniques from the perspective of multi-view learning [56, 60, 37] or co-training [7], and propose a family of new methods that systematically utilize these additional information in a systematic way, by exploiting the inherent consistency across representation theoretic patterns. Motivated by the recent line of research bridging manifold learning with principal and associated fibre bundles [57, 58, 22, 20, 19], we point out that to a -manifold admitting a principal bundle structure is naturally associated as many vector bundles as the number of distinct irreducible representations of the transformation group , and each of these vector bundles provide a separate “view” towards unveiling the geometry of the common base manifold on which all the fibre bundles reside.
Specifically, the main contributions of this paper are summarized as follows: (1) We propose a new unsupervised co-learning paradigm on -manifold and propose an optimal alignment affinity measure for high-dimensional data points that lie on or close to a lower dimensional -manifold, using both the local cycle consistency of group transformations on the manifold (graph) and the algebraic consistency of the unitary irreducible representations of the transformations; (2) We introduce the invariant moments affinity in order to bypass the computationally intensive pairwise optimal alignment search and efficiently learn the underlying local neighborhood structure; and (3) We empirically demonstrate that our new framework is extremely robust to noise and apply it to improve cryo-EM image analysis and the clustering accuracy in community detection. Code is available on https://github.com/frankfyf/G-manifold-learning.
2 Related Work
**Manifold Learning: ** After the ground-breaking works of [62, 52], [5, 56, 41] provided reproducing kernel Hilbert space frameworks for scalar and vector valued kernel and interpreted the manifold assumption as a specific type of regularization; [3, 4, 14] used the estimated eigenfunctions of the Laplace–Beltrami operator to parametrize the underlying manifold; [24, 25, 59] investigated into the representation theoretic pattern of an integral operator acting on certain complex line bundles over the unit two-sphere naturally arising from cryo-EM image analysis; [57, 58, 22] demonstrated the benefit of using differential operators defined on fibre bundles over the manifold, instead of the Laplace–Beltrami operator on the manifold itself, in manifold learning tasks. Recently, [20, 19, 23] proposed to utilize the consistency across multiple irreducible representations of a compact Lie group to improve spectral decomposition based algorithms.
**Co-training and Multi-view Learning: **In their seminal work [7], Blum and Mitchell demonstrated both in theory and in practice that distinct “views” of a dataset can be combined together to improve the performance of learning tasks, through their complementary yet consistent prediction for unlabelled data. Similar ideas exploiting the consistency of the information contained in different sets of features has long existed in statistical literature such as canonical correlation analysis [29]. Since then, multi-view learning has remained a powerful idea percolating through aspects of machine learning ranging from supervised and semi-supervised learning to active learning and transfer learning [21, 43, 61, 13, 55, 56, 34, 35]. See surveys [60, 69, 70, 37] for more detailed accounts.
3 Geometric Motivation
We first provide a brief overview of the key concepts used in this paper from elementary group representation theory. Interested readers are referred to [54, 9] for more details.
**Groups and Representation: ** A group is a set with an operation obeying the following axioms: (1) , ; (2) , ; (3) There is a unique called the identity of , such that ; (4) , there is a corresponding element called the inverse of , such that . A -dimensional representation of a group over a field is a matrix valued function such that . In this paper, we assume . A representation is said to be unitary if for any and is said to be reducible if it can be decomposed into a direct sum of lower-dimensional representations as for some invertible matrix , otherwise is irreducible, the symbol denotes the direct sum. For a compact group, there exists a complete set of inequivalent irreducible representations (in brevity: irreps) and any representation can be reduced into a direct sum of irreps.
**Fourier Transform: ** In many applications of interest, the Lie group is compact and thus always admits irreps, and the concept of irreps allows generalizing the Fourier transform to any compact group. By the renowned Peter–Weyl theorem, any square integrable function can be decomposed as
[TABLE]
where each is a unitary irrep of with dimension . This is the compact Lie group analogy of the standard Fourier series over the unit circle. The “generalized Fourier coefficient” in (1) is defined by the integral taken with respect to the Haar measure on .
**Motivation: ** Motivated by [38, 36], we consider the principal bundle structures on a -manifold . Below we state the definitions of fibre bundle and principal bundle for convenience; see [6] for more details. Briefly speaking, a fibre bundle is a manifold which is locally diffeomorphic to a product space, and a principal fibre bundle is a fibre bundle with a natural group action on its “fibres.”
Definition 1** (Fibre Bundle)**
Let be three differentiable manifolds, and let denote a smooth surjective map between and . We say that (or just for short) is a fibre bundle with typical fibre over if admits an open cover such that is diffeomorphic to product space for any open set . For any , we denote and call it the fibre over .
Definition 2** (Principal Bundle)**
Let be a fibre bundle, and a Lie group. We call a principal -bundle if (1) is a fibre bundle, (2) admits a right action of that preserves the fibres of , in the sense that for any we have , and (3) For any two points on the same fibre of , there exists a group element satisfying .
If is a principal -bundle over , any representation of on a vector space induces an associated vector bundle over with typical fibre , denoted as , defined as a quotient space \mathcal{M}\times_{\rho}V:=\mathcal{M}\times V\big{/}\sim where the equivalence relation is defined by for all , , and . This construction gives rise to as many different associated vector bundles as the number of distinct representations of the Lie group . This allows us to study the -manifold , as a principal -bundle, through tools developed for learning an unknown manifold from attached vector bundle structures, such as vector diffusion maps (VDM) [57, 58]. We consider each of these associated vector bundles as a distinct “view” towards the unknown data manifold , as the representations inducing these vector bundles are different. In the rest of this paper, we will illustrate with several examples how to design learning and inference algorithms that exploit the inherent consistency in these associated vector bundles by representation theoretic machinery. Unlike the co-training setting where the consistency is induced from the labelled samples onto the unlabelled samples, in our unsupervised setting no labelled training data is provided and the consistency is induced purely from the geometry of the -manifold.
4 Methods
**Problem Setup: ** Given a collection of data points , we assume they lie on or close to a low dimensional smooth manifold of intrinsic dimension , and that is a -manifold admitting the structure of a principal -bundle with a compact Lie group . The data space is closed under the action of . That is, for all group transformations and data points , where ‘’ denotes the group action. As an example, in a cryo-EM image dataset each image is a projection of a macromolecule with a random orientation, therefore , which is the 3-D rotation group, which is the in-plane rotation of images. The -invariant distance between two data points and is defined as
[TABLE]
where is the Euclidean distance on the ambient space and is the associated alignment which is assumed to be unique. Then we build an undirected graph whose nodes are represented by data points, edge connection is given based on using the -neighborhood criterion, i.e. iff , or -nearest neighbor criterion, i.e. iff is one of the nearest neighbors of . The edge weights are defined using a kernel function on as . The resulting graph is defined on the quotient space and is invariant to the group transformations within data points, e.g. for the viewing angles of cryo-EM images . In a noiseless world, should be a neighborhood graph which only connects data points on with small . However, in many applications, noise in the observational data severely degrades the estimations of -invariant distances and optimal alignments . This leads to errors in the edge connection of , which connect distant data points on where their underlying geodesic distances are large.
Given the noisy graph, we consider the problem of removing the wrong connections and recovering the underlying clean graph structure on , especially under high level of noise. We propose a robust, unsupervised co-learning framework for addressing this, it has two steps which first builds a series of adjacency matrices with different irreps and filters the original noisy graph as denoising, further it checks the affinity between node pairs for identifying true neighbors in the clean graph. The main intuition is to systematically explores the consistency of the group transformation of the principal bundles across all irreps of , results in a robustness measurement of the affinity (see Fig. 1).
**Weight Matrices Using Irreps: ** We start from building a series of weight matrices using multiple irreps of the compact Lie group . Given the graph with nodes and the group transformations , we assign weight on each edge by taking into account both the scalar edge connection weight and the associated alignment using unitary irreps for . The resulting graph can be described by a set of weight matrices :
[TABLE]
where and for all . Recall the unitary irrep is a matrix, therefore is a block matrix with blocks of size . In particular, the corresponding degree matrix is also a block diagonal matrix with the -block as:
[TABLE]
The Hilbert space , as a unitary representation of the compact Lie group , admits an isotypic decomposition , where a function is in if and only if . Then for each irrep , we construct a normalized matrix , which is an averaging operator for vector fields in . That is, for any vector :
[TABLE]
Notice that is similar to a Hermitian matrix as:
[TABLE]
which has real eigenvalues and orthonormal eigenvectors , and all the eigenvalues are within . For simplicity, we assume data points are uniformly distributed on . If not, the normalization proposed in [17] can be applied to . Now suppose there is a random walk on with a transition matrix and the trivial representations , then is the transition probability from to with steps. Due to the usage of , not only takes into account the connectivity between the nodes and , but also checks the consistency of transformations along all length- paths between and . Generally, in other cases when , is a sub-block matrix which still encodes such consistencies. Intuitively if are true neighbors on , their transformations should be in agreement and we expect or to be large, where is the Hilbert-Schmidt norm. Previously, vector diffusion maps (VDM) [57, 58] considers and defines the pairwise affinity as .
**Weight Matrices Filtering: ** For denoising the graph, we generalize the VDM framework by first computing the filtered and normalized weight matrix for all irreps ’s, where denotes a spectral filter acting on the eigenvalues, for example as VDM. Moreover, since the small eigenvalues of are more sensitive to noise, a truncation is applied by only keeping the top eigenvalues and eigenvectors. Specifically, we equally divide of length into blocks and denote the th block as . In this way, we define a -equivariant mapping as:
[TABLE]
It can be further normalized to ensure the diagonal blocks of are identity matrices, i.e. for all nodes . The steps for weight matrices filtering are detailed in Alg. 1. The resulting denoised is then used for building our affinity measures.
**Optimal Alignment Affinity: ** At each irrep , the filtered involves the transformation consistency of the graph represented by and has its own ability to measure the affinity. Then similar to the unsupervised multi-view learning approach, it is advantageous to boost this by coupling the information under different irreps and to achieve a more accurate measurement (see Fig. 1). Furthermore, notice that if and are true neighbors, for each irrep the block should encode the same amount of associated alignment . Therefore, by applying the algebraic relation among across all irreps, we define the optimal alignment affinity according to the generalized Fourier transform in (1) and the definition of the weight matrices in (3):
[TABLE]
which can be evaluated using generalized FFTs [39]. Here both the cycle consistency within each graph and the algebraic relation across different irreps in Fig. 1 are considered.
**Power Spectrum Affinity: ** Searching for the optimal alignment among all transformations as above could be computationally challenging and extremely time consuming. Therefore, invariant features can be used to speed up the computation. First we consider the power spectrum, which is the Fourier transform of the auto-correlation defined as according to the convolution theorem. It is transformation invariant since under the right action of , the Fourier coefficients and . Hence, for each we compute the power spectrum of and combine them as the power spectrum affinity:
[TABLE]
which does not require the search of optimal alignment and is thus computationally efficient. Recently, multi-frequency vector diffusion maps (MFVDM) [20] considers and sums the power spectrum at different irreps as their affinity. Here, we extend it to a general compact Lie group.
**Bispectrum Affinity: ** Although, the power spectrum affinity combines the information at different irreps, it does not couple them and loses the relative phase information, i.e. the transformation across different irreps (see Fig. 1). Consequently, the affinity might be inaccurate under high level of noise. In order to systematically impose the algebraic consistency without solving the optimization problem in (8), we consider another invariant feature called bispectrum, which is the Fourier transform of the triple correlation and has been used in several fields [32, 27, 72, 31]. Formally, let us consider two unitary irreps and on finite dimensional vector spaces and of the compact Lie group . There is a unique decomposition of into a set of unitary irreps , , where is the Kronecker product of matrices, and we use to denote direct sum. There exists -equivariant maps from , called generalized Clebsch–Gordan coefficients for , which satisfies:
[TABLE]
Using (10) and the fact that and ’s are unitary matrices, we have
[TABLE]
Particularly, the triple correlation of a function on can be defined as . Then the bispectrum is defined as the Fourier transform of as
[TABLE]
Under the action of , we have the following properties of the Fourier coefficients of : (1) , and (2) . Therefore, is -invariant according to (11) and (12). By combining the bispectrum at different and , we establish the bispectrum affinity as,
[TABLE]
[TABLE]
If the transformations are consistent across different ’s, the trace of in (14) should be large. Therefore, this affinity not only takes into account the consistency of the transformation at each irrep, but also explores the algebraic consistency across different irreps.
**Higher Order Invariant Moments: ** The power spectrum and bispectrum are second-order and third-order cumulants, certainly it is possible to design affinities by using higher order invariant features. For example, we can define the order- -invariant features as: , where is the extension of the Clebsch–Gordan coefficients. However, using higher order spectra dramatically increases the computational complexity. In practice, the bispectrum is sufficient to check the consistency of the group transformations between nodes and across all irreps.
**Computational Complexity: ** Filtering the normalized weight matrix involves computing the top eigenvectors of the sparse Hermitian matrices , for , which can be efficiently evaluated using block Lanczos method [51], and its cost is , where is the average number of non-zero elements in each row of . We compute the spectral decomposition for different ’s in parallel. Computing the power spectrum invariant affinity for all pairs takes flops. The computational complexity of evaluating the bispectrum invariant affinity is . For the optimal alignment affinity, the computational complexity depends on the cost of optimal alignment search and the total cost is . For certain group structures, where FFTs are developed, the optimal alignment affinity can be efficiently and accurately approximated. However, is still larger than the computation cost of invariants.
**Examples with and : ** If the group transformation is 2-D in-plane rotation, i.e. , the unitary irreps will be , where is the rotation angle. The dimensions of the irreps are , and . The generalized Clebsch–Gordan coefficients is 1 for all pairs. If is the 3-D rotation group, i.e. , the unitary irreps are the Wigner D-matrices for [68]. The dimensions of the irreps are , and . The Clebsch–Gordan coefficients for all pairs can be numerically precomputed [26]. These two classical examples are frequently used in the real world and are investigated in our experiments.
5 Experiments
We evaluate our paradigm through three examples: (1) Nearest neighbor (In brevity: NN) search on 2-sphere with ; (2) nearest viewing angle search for cryo-EM images; (3) spectral clustering with or transformation. We compare with the baseline vector diffusion maps (VDM) [57]. In particular, since the greatest advantage of our paradigm is the robustness to noise, we demonstrate this through datasets contaminated by extremely high level of noise. The setting of hyper parameters, e.g. and , are shown in the captions, we point out that our algorithm is not sensitive to the choice of parameters. The experiments are conducted in MATLAB on a computer with Intel i7 7th generation quad core CPU.
**NN Search for : **We simulate points uniformly distributed over according to the Haar measure. Each point can be represented by a orthogonal matrix , whose determinant is equal to 1. Then the vector can be realized as a point on the unit 2-sphere (i.e. ). The first two columns and spans the tangent plane of the sphere at . Given two points and , there exists a rotation angle that optimally aligns the tangent bundles to as in (2). Therefore, the manifold is a -manifold with . Then we build a clean neighborhood graph by connecting nodes with , and add noise following a random rewiring model [59]. With probability , we keep the existing edge . With probability , we remove it and link to another vertex drawn uniformly at random from the remaining vertices that are not already connected to . For those rewired edges, their alignments are uniformly distributed over according to the Haar measure. In this way, the probability controls the signal to noise ratio (SNR) where indicates the clean case, while is fully random. For each node, we identify its 50 NNs based on the three proposed affinities and the affinity in VDM. In Fig. 2 we plot the histogram of of identified NNs under different SNRs. When to (over 90% edges are corrupted), bispectrum and optimal alignment achieve similar result and outperform power spectrum and VDM. This indicates our proposed affinities are able to recover the underlying clean graph, even at an extremely high noise level.
**Nearest Viewing Angle Search for Cryo-EM Images: ** One important application of the NN search above is in cryo-EM image analysis. Given a series of projection images of a macromolecule with unknown random orientations and extremely low SNR (see Fig. 3), we aim to identify images with similar projection directions and perform local rotational alignment, then image SNR can be boosted by averaging the aligned images. Therefore, each projection image can be viewed as a point lying on the 2-sphere (i.e. ), and the group transformation is the in-plane rotation of an image (i.e., ).
In our experiments, we simulate projection images from a 3D electron density map of the 70S ribosome, the orientations of all projections are uniformly distributed over and the images are contaminated by additive white Gaussian noise (see Fig. 3 for noisy samples). As preprocessing, we build the initial graph by using fast steerable PCA (sPCA) [71] and rotationally invariant features [72] to initially identify the images of similar views and the corresponding in-plane rotational alignments. Similar to the example above, we compute the affinities for NNs identification. In Fig. 3, we display the histograms of of identified NNs under different SNRs. Result shows that all proposed affinities outperform VDM. The power spectrum and the bispectrum affinities achieve similar result, and outperform the optimal alignment affinity. This result is different from the previous example with the random rewiring model on . This is because those two examples have different noise model, the random rewiring model has independent noise on edges, whereas the examples using cryo-EM images have independent noise on nodes with dependent noise on edges.
**Spectral Clustering with or Transformations: ** We apply our framework to spectral clustering. In particular, we assume there exists a group transformation in addition to the scalar weight between members (nodes) in a network. Formally, given data points with equal sized clusters, for each point , we uniformly assign an in-plane rotational angle , or a 3-D rotation . Then the optimal alignment is , or . We build the clean graph by fully connecting nodes within each cluster. The noisy graph is then built following the random rewiring model with a rewiring probability . We perform clustering by using our proposed affinities as the input of spectral clustering, and compare with the traditional spectral clustering [45, 65] which only takes into account the scalar edge connection, and VDM [57], which defines affinity based on the transformation consistency at a single representation. In Tab. 1, we use Rand index [48] to measure the performance (larger value is better). Our three affinities achieve similar accuracy and they outperform the traditional spectral clustering (scalar) and VDM. The results reported in Tab. 1 are evaluated over 50 trials for and 10 trials for respectively.
For a better understanding, we visualize the affinity matrices by different approaches as shown in Fig. 4a at and . We observe that at high noise levels, such as or 0.2, the underlying 2-cluster structure is visually easier to be identified through our proposed affinities. In particular, as the bispectrum affinity in (13) is the combination of the bispectrum coefficients , Fig. 4b shows the component at different . Visually, the 2-cluster structure appears in each component with some variations across different components. Combining those information together results in a more robust classifier.
6 Conclusion
In this paper, we propose a novel mathematical and computational framework for unsupervised co-learning on -manifolds across multiple unitary irreps for robust nearest neighbor search and spectral clustering. We have a two stage algorithm: At the first stage, the graph adjacency matrices are individually denoised through spectral filtering. This step uses the local cycle consistency of the group transformation; The second stage checks the algebraic consistency over different irreps and we propose three different ways to combine the information across all irreps. Using invariant moments bypasses the pairwise alignment and is computationally more efficient than the affinity based on the optimal alignment search. Experimental results show the efficacy of the framework compared to the state-of-the-art methods, which do not take into account of the transformation group or only use a single representation.
Acknowledgement:
This work is supported in part by the National Science Foundation DMS-185479 and DMS-1854831.
Appendix A Additional Results on Spectral Clustering
In the main paper, we visualize the affinity matrices of clusters with for spectral application, in the presence of edge noise. Here we provide another visualization of the affinity measures for the results in Table 1 of the main paper, with and . In Fig. 5, we show the affinity matrices using single frequency () VDM, power spectrum, and bispectrum. The cutoff parameter and maximum frequency are set as and . We observe similar patterns for the 2-cluster example (see Fig. 4a in the main paper). For noisy examples with and , the cluster structure is more easily identified through our proposed affinities compared to scalar edge weights used in the traditional spectral clustering [45], and frequency VDM [57]. This again demonstrates the efficacy of our approach in estimating the cluster structures in the presence of large level of noise on edges.
Appendix B Performance under Different Choices of Parameters
In this section, we include more numerical results to show the performance of our methods under different parameter settings and provide theoretical justification under a probabilistic model.
B.1 Nearest Neighbor Identification on Base Manifold
First, we analyze the spectral properties of the matrix based on the random rewiring model [59]. Starting from the underlying true graph, we perturb the graph in the following way: with probability , we remove the clean edge and create a link between and some random vertex, drawn uniformly at random from the remaining vertices that are not connected to . If the link between and is a rewired random link, then the associated group element is distributed over according to the Haar measure. The corresponding matrix is a random matrix under this model. If the distribution of is the Haar measure on , we have for . Therefore, we get for , where is the matrix with all links and group elements inferred correctly (). Thus the matrix can be decomposed as,
[TABLE]
where is a Hermitian random matrix with random blocks. The upper triangular part of the matrix contains independent random blocks with finite moments (the elements of are all bounded). Thus we use to describe the signal-to-noise ratio of the observed graph. According to the matrix perturbation theory, the top eigenvectors of approximates the top eigenvectors of as long as the 2-norms of is not too large.
We numerically test the sensitivity of our methods to the choice of parameters in application to the nearest neighbor identification on base manifold. The set up of the experiments is similar to Section 5 in the main paper. We simulate data points uniformly sampled from and build the clean neighborhood graph on . The random rewiring perturbation is then applied to the clean graph and the nearest neighbors are identified based on the proposed affinities, with two varying parameters: cutoff parameter and maximum frequency . We evaluate by computing the proportion (in percentage) of all identified nearest neighbor pairs ’s whose . The results are shown in Tab. 2a and Tab. 3a. We have the following observation.
Cutoff Parameter :
Ideally, at each frequency , the truncation cutoff , that is, selecting the top eigenvectors of matrix , should be set to include top eigenvectors that are not largely perturbed by noise and have nontrivial correlation with the eigenvectors of the clean matrix . This value can vary between different frequencies. However, in practice, we set to be a moderate constant for all ’s. In all the trials, we set . In Tab. 2a, we observe that the accuracy is first improved when increases since more information is included. However, the accuracy degrades or gets saturated when is larger than a certain dependent value due to the effects of noise. This implies a moderate is needed for a trade-off between the useful information and the impact of noise.
Maximum Frequency :
We test from 2 to 100 and show the results in Tab. 3a. We fix , when varying . In the extremely noisy cases, such as and , the results improve when increases within the range of the values we test. When , the accuracy first increases but then degrades or gets saturated after a certain dependent value of . This indicates that the optimal choice of depends on the noise level. Under this particular noise perturbation model, the higher the noise level is, the larger the is needed. Also, since for all three proposed affinities the computation complexity greatly increase with a growing , it should be chosen that our computation budget can afford.
B.2 Spectral Clustering
We check the performance of our methods in spectral clustering under different parameter settings. From the clean cluster graph, we apply the random rewiring perturbation as described above.
Cutoff Parameter :
In the clean case, the number of non-zero eigenvalues of the weight matrices is for clusters. Therefore, each has a low-rank structure and so is the normalized Hermitian matrix . Then a truncation at top eigenvectors (i.e. ) is enough for clustering. In the noise case, following the model in (15), we are still able to use the top eigenvectors for clustering as long as the signal-to-noise ratio is not too small. Using less eigenvectors as will lead to loss of information and using will include spurious information from noise. We conduct the experiments with clusters, where each cluster contains 50 points, and . We set and . We vary the cutoff from 2 to 100 and display the Rand indices of the clustering results from different methods in Tab. 4a. Tab. 4a shows that all of our proposed affinity measures achieve their best performance when and the performance degrades when is too small or too large. We conclude that setting should be a good choice for spectral clustering.
Maximum Frequency :
We run another experiment with clusters, , and , with . Each cluster contains 50 points. We vary from 2 to 100 and show the Rand indices of clustering results in Tab. 5a. We observe that the accuracy gets improved with increasing for all three proposed affinities. However, using a larger increases the computational complexities for all three affinity measures and the dimension of the irrep might increase with (e.g. the dimension of Wigner -matrix at index is ), which is undesirable. There is a trade-off between the statistical accuracy and computational complexity. Therefore, we use a moderate in the main paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Andrey V. Alekseevsky and Dmitry V. Alekseevsky. Riemannian G 𝐺 G -manifold with one-dimensional orbit space. Annals of global analysis and geometry , 11(3):197–211, 1993.
- 2[2] Derek Bean, Peter J. Bickel, Noureddine El Karoui, and Bin Yu. Optimal M-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences , 110(36):14563–14568, 2013.
- 3[3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS , 2002.
- 4[4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation , 2003.
- 5[5] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research , 7(Nov):2399–2434, 2006.
- 6[6] Nicole Berline, Ezra Getzler, and Michèle Vergne. Heat Kernels and Dirac Operators (Grundlehren Text Editions) . Springer, 1992 edition, 12 2003.
- 7[7] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory , pages 92–100. ACM, 1998.
- 8[8] Paul Breiding, Sara Kališnik, Bernd Sturmfels, and Madeleine Weinstein. Learning algebraic varieties from samples. Revista Matemática Complutense , 31(3):545–593, 2018.
