StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, Marco Cristani

TL;DR
StructXLIP improves vision-language models by incorporating structural cues through edge-based representations and specialized alignment losses, leading to better cross-modal retrieval performance and more robust semantic understanding.
Contribution
The paper introduces a novel fine-tuning paradigm that leverages edge maps and structure-centric losses to enhance vision-language alignment beyond standard methods.
Findings
Outperforms existing methods on cross-modal retrieval tasks.
Enhances robustness and semantic stability of vision-language models.
Provides a plug-and-play approach for future model improvements.
Abstract
Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
