LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
Jingyi Wang, Jianzhong Ju, Jian Luan, Zhidong Deng

TL;DR
This paper introduces a Scene Graph Expression (SGE) module for vision-language models, enhancing their ability to understand complex semantic information in images by structurally representing scene graphs, leading to improved performance.
Contribution
The paper presents a novel SGE module that structurally encodes semantic information, addressing the fragmented perception issue in ViT-based VLMs, and improves visual understanding.
Findings
SGE module significantly improves VLM performance
Enhances understanding of complex semantic details
Facilitates better visual perception in models
Abstract
Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
