From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme
Xueyan Li, Yingyi Xue, Mengjie Jiang, Qingzi Zhu, Yazhe Niu

TL;DR
This paper introduces HUMOR, a framework that enhances vision-language models to generate humorous memes by hierarchical reasoning and aligning with human preferences, resulting in more diverse and higher-quality meme creation.
Contribution
HUMOR employs a multi-path Chain-of-Thought reasoning and a group-wise reward model to improve humor quality and subjective preference alignment in multimodal meme generation.
Findings
HUMOR achieves superior reasoning diversity in meme generation.
The reward model ensures consistent alignment with human preferences.
HUMOR improves overall meme quality across various models.
Abstract
Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR}, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human preferences. First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path. This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity. We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
Really good framing of the problem - a hard problem to tackle (subjective, multimodal) but a good formulation. Breaking out the chain of thought to first reason about the template and then about the grounding / text is a great way to approach it, similar to how a human would go about making a meme. I like that the authors headed off questions about necessity of using group-wise ranking by showing that absolute VLM scoring is not as aligned with human ranking as group-wise. Overall - I think taki
There aren't as many weaknesses that I caught in this paper. Theoretically, it makes sense that using GRPO would limit the deviation from the supervised policy, but I wonder how much it ensures the improvement in the generated outputs. One thing I would note is that a lot of the structures used to improve alignment to human preferences are engineered for this particular scenario of maximizing humor in the generated image. It would be great to show how this could be generalized for other use ca
- Novel problem formulation: The paper addresses the challenging task of meme generation as a group-wise reasoning problem, acknowledging that humor comparability is more reliable within meme templates than across them. - Theoretically grounded approach: The framework provides theoretical guarantees including conditional humor lower bounds (Proposition 1), rank consistency (Proposition 2), and bounded degradation under KL control (Proposition 4). - Comprehensive evaluation: The paper includes bo
- Missing critical citations: The paper lacks important references in humor understanding using LLMs and alignment of subjective humor preferences. Notable omissions include work by Hessel et al. (2023), Zhang et al. (2024), Zhou et al. (2025), Kazemi et al. (2025), Liang et al. (2025), Binsted et al. (2006), and Apte et al. The claim about CoT improving VLM reasoning also needs citation support. - Insufficient human evaluation details: The paper provides no information about human annotators -
- The paper provides a solid theoretical foundation and formal modeling for the meme understanding problem in group-wise. - The experiments explicitly optimize the reasoning process by finetuning with CoT data and use reinforcement learning to align the model’s outputs with human humor distributions
- The Hierarchical CoT framework, while conceptually rich, still depends heavily on VLMs to extract the template intent. If the intent extraction is incorrect (e.g., the model misinterprets the scene or theme), all subsequent reasoning chains may deviate. - The paper lacks a detailed statistical analysis of the dataset, such as the content composition, image complexity or cultural diversity of memes. - The proposed method is only evaluated on single-panel meme images; it may not generalize to mu
1. Clear and well-motivated problem formulation. The paper compellingly reframes meme generation as a group-wise, open-ended reasoning problem with formal notation, explicitly addressing that humor is subjective and cross-template comparisons are unreliable. 2. Comprehensive theoretical analysis with rigorous proofs. Four propositions cover humor quality preservation under multi-path CoT, ranking consistency and noise robustness of pairwise rewards, and KL-constrained optimization guarantees, w
1. Insufficient novelty in methodology beyond multi-path CoT and group-wise modeling. While the hierarchical CoT and group-wise preference modeling are reasonable contributions, the subsequent stages (SFT, preference modeling, GRPO) are direct applications of existing mature methods without task-specific innovations. The paper reads more as an engineering combination rather than a methodological advancement, and the theoretical analysis of SFT and GRPO largely restates results already well-estab
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHumor Studies and Applications · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
