On Analyzing the Role of Image for Visual-enhanced Relation Extraction
Lei Li, Xiang Chen, Shuofei Qiao, Feiyu Xiong, Huajun Chen, Ningyu, Zhang

TL;DR
This paper analyzes the impact of visual information quality on multimodal relation extraction and proposes a Transformer-based method with implicit fine-grained alignment, showing improved performance.
Contribution
It provides an empirical analysis of visual scene graph inaccuracies and introduces a novel Transformer-based baseline with implicit alignment for better multimodal relation extraction.
Findings
Inaccurate visual scene graphs degrade modal alignment and performance.
Current methods do not fully utilize visual information.
Proposed method outperforms existing approaches.
Abstract
Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding
