TL;DR
This paper introduces a multimodal fusion architecture for hate speech and sentiment detection in Nepali memes, demonstrating improved performance and revealing key challenges in low-resource, script-specific contexts.
Contribution
It proposes a hybrid cross-modal attention model combining visual and multilingual text encoders, with insights into model limitations and data scarcity effects.
Findings
Explicit cross-modal reasoning improves F1-macro by 5.9% over text-only baselines.
English-centric vision models perform poorly on Devanagari script.
Ensemble methods can degrade under data scarcity due to overfitting.
Abstract
Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
