TL;DR
This paper introduces KrossFuse, a novel kernel fusion method using Kronecker products to combine embeddings from different models, improving multi-modal data representation and bridging the gap between cross-modal and unimodal embeddings.
Contribution
The paper proposes a principled kernel multiplication approach for embedding fusion, along with a scalable approximation method, enhancing multi-modal and unimodal embedding integration.
Findings
RP-KrossFuse effectively combines models, improving performance.
Fusion preserves cross-modal alignment while enhancing modality-specific accuracy.
The approach bridges the gap between cross-modal and unimodal embeddings.
Abstract
State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsFocus · Contrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training
