TL;DR
This paper introduces the Indra Representation Hypothesis, proposing a relational structure-based representation that improves robustness and alignment across unimodal foundation models in vision, language, and audio.
Contribution
It formalizes the Indra representation using category theory, demonstrating its effectiveness for training-free cross-modal and cross-architecture alignment.
Findings
Indra representations enhance robustness across models and modalities.
The approach is theoretically grounded and improves alignment without additional training.
Experiments show consistent performance gains in diverse scenarios.
Abstract
Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
