Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework   for Visual Commonsense Reasoning

Jian Zhu; Hanli Wang; Miaojing Shi

arXiv:2301.13335·cs.CV·December 27, 2023

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

Jian Zhu, Hanli Wang, Miaojing Shi

PDF

Open Access

TL;DR

This paper introduces a novel multi-modal large language model framework that incorporates pseudo 3D object perception and depth-aware reasoning to improve visual commonsense reasoning accuracy.

Contribution

It proposes integrating object depth into VCR models, a depth-aware Transformer, and depth-tagged answer words, enhancing multi-modal reasoning with 3D spatial understanding.

Findings

01

Outperforms state-of-the-art on VCR dataset

02

Effectively models 3D object relations in images

03

Enhances reasoning accuracy with depth-aware mechanisms

Abstract

The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key words in texts. However, existing approaches do not consider exact positions of objects in a human-like three-dimensional (3D) manner, making them incompetent to accurately distinguish objects and understand visual relation. Recently, multi-modal large language models (MLLMs) have been used as powerful tools for several multi-modal tasks but not for VCR yet, which requires elaborate reasoning on specific visual objects referred by texts. In light of the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR. Specifically, we first demonstrate that the relation between objects is relevant to object depths in images, and hence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer