Semantic Residual for Multimodal Unified Discrete Representation

Hai Huang; Shulei Wang; Yan Xia

arXiv:2412.19128·cs.CV·December 30, 2024

Semantic Residual for Multimodal Unified Discrete Representation

Hai Huang, Shulei Wang, Yan Xia

PDF

Open Access

TL;DR

This paper introduces SRCID, a novel framework for multimodal unified representations that employs semantic residual-based disentanglement, significantly improving cross-modal generalization and zero-shot retrieval over existing models.

Contribution

The work proposes a new quantization framework, SRCID, that leverages semantic residuals for better multimodal data representation and handling modality discrepancies.

Findings

01

Outperforms state-of-the-art models in cross-modal tasks

02

Enhances zero-shot retrieval capabilities

03

Demonstrates superior generalization across modalities

Abstract

Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing Vector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual Vector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Advanced Computational Techniques and Applications · Rough Sets and Fuzzy Logic