On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Lei Li; Xiang Chen; Shuofei Qiao; Feiyu Xiong; Huajun Chen; Ningyu; Zhang

arXiv:2211.07504·cs.CL·November 15, 2022·5 cites

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Lei Li, Xiang Chen, Shuofei Qiao, Feiyu Xiong, Huajun Chen, Ningyu, Zhang

PDF

Open Access 2 Repos

TL;DR

This paper analyzes the impact of visual information quality on multimodal relation extraction and proposes a Transformer-based method with implicit fine-grained alignment, showing improved performance.

Contribution

It provides an empirical analysis of visual scene graph inaccuracies and introduces a novel Transformer-based baseline with implicit alignment for better multimodal relation extraction.

Findings

01

Inaccurate visual scene graphs degrade modal alignment and performance.

02

Current methods do not fully utilize visual information.

03

Proposed method outperforms existing approaches.

Abstract

Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding