GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective
Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

TL;DR
GOMA is a novel framework that refines frozen multimodal embeddings using graph signal smoothing techniques to improve retrieval performance on multimodal attributed graphs.
Contribution
It introduces a structure-driven post-alignment method that learns modality-aware propagation, performs controlled smoothing, and adaptively preserves useful semantic information.
Findings
GOMA achieves state-of-the-art retrieval on seven MAG benchmarks.
It is more stable than previous graph-based methods.
GOMA effectively leverages graph structure as an unlabeled context.
Abstract
Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
