Semantic Residual for Multimodal Unified Discrete Representation
Hai Huang, Shulei Wang, Yan Xia

TL;DR
This paper introduces SRCID, a novel framework for multimodal unified representations that employs semantic residual-based disentanglement, significantly improving cross-modal generalization and zero-shot retrieval over existing models.
Contribution
The work proposes a new quantization framework, SRCID, that leverages semantic residuals for better multimodal data representation and handling modality discrepancies.
Findings
Outperforms state-of-the-art models in cross-modal tasks
Enhances zero-shot retrieval capabilities
Demonstrates superior generalization across modalities
Abstract
Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing Vector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual Vector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Computational Techniques and Applications · Rough Sets and Fuzzy Logic
