# Inference of compressed Potts graphical models

**Authors:** Francesca Rizzato, Alice Coucke (LPENS (UMR\_8023)), Eleonora de, Leonardis (LPENS (UMR\_8023)), J. P. Barton (DCE-MIT), J\'er\^ome Tubiana, (LPENS (UMR\_8023)), Remi Monasson (LPENS (UMR\_8023)), Simona Cocco (LPENS, (UMR\_8023))

arXiv: 1907.12793 · 2020-01-29

## TL;DR

This paper introduces a regularization approach for inferring Potts graphical models that reduces complexity by grouping less frequent states, improving computational efficiency without sacrificing accuracy for high-frequency symbols.

## Contribution

The study proposes a double regularization scheme combining color compression and sparsity, validated with two inference algorithms on synthetic and biological data.

## Key findings

- Color compression preserves high-frequency symbol accuracy.
- Significant reduction in parameters and computational time.
- Effective application to protein family sequence data.

## Abstract

We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of colors available to each variable is reduced, and interaction networks are made sparse. To achieve this color compression scheme, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, the Adaptive Cluster Expansion (ACE) and the PseudoLikelihood Maximization (PLM) on synthetic data obtained by sampling disordered Potts models on an Erdos-Renyi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multi-sequence alignments of protein families, with similar results.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.12793/full.md

## Figures

64 figures with captions in the complete paper: https://tomesphere.com/paper/1907.12793/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/1907.12793/full.md

---
Source: https://tomesphere.com/paper/1907.12793