GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

Faxian Wan; Xiaocui Yang; Yifan Cao; Shi Feng; Daling Wang; Yifei Zhang

arXiv:2604.08879·cs.CL·April 13, 2026

GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

Faxian Wan, Xiaocui Yang, Yifan Cao, Shi Feng, Daling Wang, Yifei Zhang

PDF

1 Repo

TL;DR

GRASP introduces a grounded, dual-stage optimized reasoning framework for multimodal sarcasm target identification, enhancing interpretability and localization accuracy over existing methods.

Contribution

It proposes Grounded CoT reasoning with dual-stage optimization and curates MSTI-MAX, a dataset that improves fine-grained sarcasm target detection.

Findings

01

Outperforms baselines in multimodal sarcasm target identification

02

Explicit grounding improves interpretability and localization

03

LLM-based evaluation assesses reasoning quality

Abstract

Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.