Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction

Liu Jing; Amirul Rahman

arXiv:2412.09870·cs.CV·December 16, 2024

Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction

Liu Jing, Amirul Rahman

PDF

TL;DR

This paper presents CoVLA, a novel framework that improves semantic location prediction from social media posts by effectively aligning and integrating visual and textual data, outperforming existing methods in accuracy and robustness.

Contribution

Introduction of CoVLA, a discriminative framework with novel modules for cross-modal alignment and fusion, enhancing semantic location prediction accuracy and robustness.

Findings

01

Achieves 2.3% higher accuracy than state-of-the-art methods.

02

Demonstrates robustness under noisy data conditions.

03

Validates effectiveness through extensive experiments and human evaluations.

Abstract

Semantic location prediction from multimodal social media posts is a critical task with applications in personalized services and human mobility analysis. This paper introduces \textit{Contextualized Vision-Language Alignment (CoVLA)}, a discriminative framework designed to address the challenges of contextual ambiguity and modality discrepancy inherent in this task. CoVLA leverages a Contextual Alignment Module (CAM) to enhance cross-modal feature alignment and a Cross-modal Fusion Module (CMF) to dynamically integrate textual and visual information. Extensive experiments on a benchmark dataset demonstrate that CoVLA significantly outperforms state-of-the-art methods, achieving improvements of 2.3\% in accuracy and 2.5\% in F1-score. Ablation studies validate the contributions of CAM and CMF, while human evaluations highlight the contextual relevance of the predictions. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsClass-activation map