Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction
Liu Jing, Amirul Rahman

TL;DR
This paper presents CoVLA, a novel framework that improves semantic location prediction from social media posts by effectively aligning and integrating visual and textual data, outperforming existing methods in accuracy and robustness.
Contribution
Introduction of CoVLA, a discriminative framework with novel modules for cross-modal alignment and fusion, enhancing semantic location prediction accuracy and robustness.
Findings
Achieves 2.3% higher accuracy than state-of-the-art methods.
Demonstrates robustness under noisy data conditions.
Validates effectiveness through extensive experiments and human evaluations.
Abstract
Semantic location prediction from multimodal social media posts is a critical task with applications in personalized services and human mobility analysis. This paper introduces \textit{Contextualized Vision-Language Alignment (CoVLA)}, a discriminative framework designed to address the challenges of contextual ambiguity and modality discrepancy inherent in this task. CoVLA leverages a Contextual Alignment Module (CAM) to enhance cross-modal feature alignment and a Cross-modal Fusion Module (CMF) to dynamically integrate textual and visual information. Extensive experiments on a benchmark dataset demonstrate that CoVLA significantly outperforms state-of-the-art methods, achieving improvements of 2.3\% in accuracy and 2.5\% in F1-score. Ablation studies validate the contributions of CAM and CMF, while human evaluations highlight the contextual relevance of the predictions. Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsClass-activation map
