# Finding a latent k-simplex in O(k . nnz(data)) time via Subset Smoothing

**Authors:** Chiranjib Bhattacharyya, Ravindran Kannan

arXiv: 1904.06738 · 2020-01-07

## TL;DR

This paper introduces an efficient geometric algorithm for latent variable models that finds a latent k-simplex in data with near-linear time complexity, improving scalability and robustness in various models.

## Contribution

It proposes a novel subset smoothing technique and an algorithm that efficiently finds a latent k-polytope, unifying multiple models under a geometric framework with improved computational performance.

## Key findings

- Runs in O*(k * nnz) time, matching or surpassing existing algorithms.
- First quasi-input-sparsity time algorithm for certain models with small k.
- Robustly estimates cluster centers even under adversarial data perturbations.

## Abstract

In this paper we show that a large class of Latent variable models, such as Mixed Membership Stochastic Block(MMSB) Models, Topic Models, and Adversarial Clustering, can be unified through a geometric perspective, replacing model specific assumptions and algorithms for individual models. The geometric perspective leads to the formulation: \emph{find a latent $k-$ polytope $K$ in ${\bf R}^d$ given $n$ data points, each obtained by perturbing a latent point in $K$}. This problem does not seem to have been considered in the literature. The most important contribution of this paper is to show that the latent $k-$polytope problem admits an efficient algorithm under deterministic assumptions which naturally hold in Latent variable models considered in this paper. ur algorithm runs in time $O^*(k\; \mbox{nnz})$ matching the best running time of algorithms in special cases considered here and is better when the data is sparse, as is the case in applications. An important novelty of the algorithm is the introduction of \emph{subset smoothed polytope}, $K'$, the convex hull of the ${n\choose \delta n}$ points obtained by averaging all $\delta n$ subsets of the data points, for a given $\delta \in (0,1)$. We show that $K$ and $K'$ are close in Hausdroff distance. Among the consequences of our algorithm are the following: (a) MMSB Models and Topic Models: the first quasi-input-sparsity time algorithm for parameter estimation for $k \in O^*(1)$, (b) Adversarial Clustering: In $k-$means, if, an adversary is allowed to move many data points from each cluster an arbitrary amount towards the convex hull of the centers of other clusters, our algorithm still estimates cluster centers well.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.06738/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1904.06738/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/1904.06738/full.md

---
Source: https://tomesphere.com/paper/1904.06738