Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning
Junwon You, Mihyun Jang, Sangwoo Mo, Jae-Hun Jung

TL;DR
This paper introduces ToMA, a topology-aware framework using persistent homology to improve multimodal representation alignment in semi-supervised vision-language learning, enhancing stability and structural modeling.
Contribution
It proposes a novel topology-based alignment method leveraging persistent homology to better capture multimodal structure without complex simplices.
Findings
ToMA improves performance on remote sensing tasks.
ToMA provides stable gains over existing methods.
Lightweight H_1-birth edges capture useful higher-order structures.
Abstract
Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify topologically salient edges and aligns them across modalities through available cross-modal correspondences. ToMA leverages both H_0-death edges and lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
