LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in   Vision-Language Models

Jingyi Wang; Jianzhong Ju; Jian Luan; Zhidong Deng

arXiv:2408.16224·cs.CV·September 2, 2024

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Jingyi Wang, Jianzhong Ju, Jian Luan, Zhidong Deng

PDF

Open Access

TL;DR

This paper introduces a Scene Graph Expression (SGE) module for vision-language models, enhancing their ability to understand complex semantic information in images by structurally representing scene graphs, leading to improved performance.

Contribution

The paper presents a novel SGE module that structurally encodes semantic information, addressing the fragmented perception issue in ViT-based VLMs, and improves visual understanding.

Findings

01

SGE module significantly improves VLM performance

02

Enhances understanding of complex semantic details

03

Facilitates better visual perception in models

Abstract

Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings