EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma, Zhihuan Yu, Yichao Ma, Guohui Li

TL;DR
EventLens enhances visual commonsense reasoning by combining event-aware pretraining, cross-modal linking, and prompt strategies, enabling LLMs to better understand and reason about complex visual scenarios with fine-grained alignment.
Contribution
The paper introduces EventLens, a novel framework that activates LLMs' reasoning through event-aware pretraining and improves fine-grained multimodal alignment for VCR.
Findings
EventLens outperforms existing models on VCR benchmarks.
The auxiliary event-aware pretraining improves reasoning capabilities.
Fine-grained linking enhances understanding of image-text co-reference.
Abstract
Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Human Pose and Action Recognition
