Statistical data mining for symbol associations in genomic databases
Bernard Ycart, Fr\'ed\'eric Pont, Jean-Jacques Fourni\'e

TL;DR
This paper introduces a statistical methodology to identify significant symbol associations in genomic databases, revealing both known and novel gene or protein interactions through graph analysis of symbol co-occurrences.
Contribution
A new statistical test and graph-based approach for detecting significant symbol associations in genomic data, applicable to various databases and capable of uncovering novel biological interactions.
Findings
Detected known gene interactions in MSigDB C2 database
Identified novel symbol associations in specific gene sets
Method applicable to any genomic database
Abstract
A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test is proposed to assess the significance of a group of symbols when found in several genesets of a given database. Applied to symbol pairs, the thresholded p-values of the test define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections did correspond to already known interactions. On more specific selections of C2, many previously unkown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
