GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

Xu Wang; Xunkai Li; Yinlin Zhu; Rong-Hua Li; Guoren Wang

arXiv:2605.15723·cs.LG·May 18, 2026

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

PDF

TL;DR

GOMA is a novel framework that refines frozen multimodal embeddings using graph signal smoothing techniques to improve retrieval performance on multimodal attributed graphs.

Contribution

It introduces a structure-driven post-alignment method that learns modality-aware propagation, performs controlled smoothing, and adaptively preserves useful semantic information.

Findings

01

GOMA achieves state-of-the-art retrieval on seven MAG benchmarks.

02

It is more stable than previous graph-based methods.

03

GOMA effectively leverages graph structure as an unlabeled context.

Abstract

Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.