DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding
Qiyuan Jin

TL;DR
DARC-CLIP introduces an adaptive multimodal fusion framework with hierarchical refinement for improved meme understanding and sensitive content detection.
Contribution
It proposes Adaptive Cross-Attention Refiners and Dynamic Feature Adapters for bidirectional signal alignment in CLIP-based models.
Findings
Achieves +4.18 AUROC and +6.84 F1 in hate detection.
Outperforms static fusion baselines on PrideMM benchmark.
Ablation confirms ACAR and DFA as key improvements.
Abstract
Memes convey meaning through the interaction of visual and textual signals, often combining humor, irony, and offense in subtle ways. Detecting harmful or sensitive content in memes requires accurate modeling of these multimodal cues. Existing CLIP-based approaches rely on static fusion, which struggles to capture fine grained dependencies between modalities. We propose DARC-CLIP, a CLIP-based framework for adaptive multimodal fusion with a hierarchical refinement stack. DARC-CLIP introduces Adaptive Cross-Attention Refiners to for bidirectional information alignment and Dynamic Feature Adapters for task-sensitive signal adaptation. We evaluate DARC-CLIP on the PrideMM benchmark, which includes hate, target, stance, and humor classification, and further test generalization on the CrisisHateMM dataset. DARC-CLIP achieves highly competitive classification accuracy across tasks, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
