Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning
Jian Zhu, Hanli Wang, Miaojing Shi

TL;DR
This paper introduces a novel multi-modal large language model framework that incorporates pseudo 3D object perception and depth-aware reasoning to improve visual commonsense reasoning accuracy.
Contribution
It proposes integrating object depth into VCR models, a depth-aware Transformer, and depth-tagged answer words, enhancing multi-modal reasoning with 3D spatial understanding.
Findings
Outperforms state-of-the-art on VCR dataset
Effectively models 3D object relations in images
Enhances reasoning accuracy with depth-aware mechanisms
Abstract
The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key words in texts. However, existing approaches do not consider exact positions of objects in a human-like three-dimensional (3D) manner, making them incompetent to accurately distinguish objects and understand visual relation. Recently, multi-modal large language models (MLLMs) have been used as powerful tools for several multi-modal tasks but not for VCR yet, which requires elaborate reasoning on specific visual objects referred by texts. In light of the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR. Specifically, we first demonstrate that the relation between objects is relevant to object depths in images, and hence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer
