Loading paper
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues | Tomesphere